-
Couldn't load subscription status.
- Fork 25.7k
[PyTorch] Improve conversion from/to bool on aarch64+sve #166330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166330
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bb6f1bf with merge base 2dc5645 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" "release notes: cpu (aarch64)" |
9721e28 to
10df531
Compare
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
b5b988c to
1686b10
Compare
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
1686b10 to
15f7a1c
Compare
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
15f7a1c to
b80567f
Compare
Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
b80567f to
e9ddff5
Compare
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
e9ddff5 to
0d48af2
Compare
) Summary: We are adding autovec routines to convert to/from boolean values We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16 before: bool->uint8->bool ===> 447.854us bool->int8->bool ===> 445.609us bool->int16->bool ===> 312.425us bool->int32->bool ===> 324.368us bool->float->bool ===> 320.929us bool->float16->bool ===> 290.825us bool->bfloat16->bool ===> 437.250us after bool->uint8->bool ===> 78.988us ----> 467% higher throughput bool->int8->bool ===> 78.494us -----> 468% higher throughput bool->int16->bool ===> 107.993us ----> 189% higher throughput bool->int32->bool ===> 186.887us -----> 74% higher throughput bool->float->bool ===> 188.048us ------> 71% higher throughput bool->float16->bool ===> 102.789us --> 183% higher throughput bool->bfloat16->bool ===> 105.809us -> 313% higher throughput Test Plan: Correctness: buck2 test mode/opt //caffe2/test:test_ops buck2 test mode/opt //caffe2/test:torch Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Reviewed By: mcfi Differential Revision: D85533284
0d48af2 to
bb6f1bf
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
We are adding autovec routines to convert to/from boolean values
We observed the following performance improvements when compiling targeting armv9-a+sve2+fp16+bf16
before:
bool->uint8->bool ===> 447.854us
bool->int8->bool ===> 445.609us
bool->int16->bool ===> 312.425us
bool->int32->bool ===> 324.368us
bool->float->bool ===> 320.929us
bool->float16->bool ===> 290.825us
bool->bfloat16->bool ===> 437.250us
after
bool->uint8->bool ===> 78.988us ----> 467% higher throughput
bool->int8->bool ===> 78.494us -----> 468% higher throughput
bool->int16->bool ===> 107.993us ----> 189% higher throughput
bool->int32->bool ===> 186.887us -----> 74% higher throughput
bool->float->bool ===> 188.048us ------> 71% higher throughput
bool->float16->bool ===> 102.789us --> 183% higher throughput
bool->bfloat16->bool ===> 105.809us -> 313% higher throughput
Test Plan:
Correctness:
buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch
Performance:
buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test
Reviewed By: mcfi
Differential Revision: D85533284
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168