-
Notifications
You must be signed in to change notification settings - Fork 25k
[pytorch][perf] add mobile friendly at:parallel_for backend #26702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in on batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in on batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. ghstack-source-id: 1a31834 Pull Request resolved: #26702
aten/src/ATen/ParallelNativeMobile.h
Outdated
internal::calc_num_tasks_and_chunk_size(begin, end, grain_size); | ||
std::vector<scalar_t> results(num_tasks); | ||
scalar_t* results_data = results.data(); | ||
parallel_for( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I diverged from ParallelNative.h - I used parallel_for() + get_thread_num() to implement it in order to reduce the amount of complicated logic in this PR. Let me know if you think I should take the original approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?
We probably can merge ParallelNative.h and ParallelNativeMobile.h - just implement internal::_run_with_pool API with c10::ThreadPool there. The new implementation doesn't expose those internal APIs like internal::_get_intraop_pool / internal::_set_thread_num / etc in header file.
We probably can keep separate cpp files though as each function is slightly different.
@ilia-cher please let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to refactor internal namespace functions (y)
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. ghstack-source-id: 6615f7c Pull Request resolved: #26702
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Awesome work.
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. ghstack-source-id: 6615f7c Pull Request resolved: #26702
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I'll let Ilia to double-check
aten/src/ATen/ParallelNativeMobile.h
Outdated
} | ||
futures[task_id].markCompleted(); | ||
}; | ||
internal::_run_with_pool(task, num_tasks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't utilize the main thread to do some compute, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caffe2's thread pool (gemmlowp clone) runs one task (first one) on current thread directly: https://github.com/pytorch/pytorch/blob/master/caffe2/utils/threadpool/WorkersPool.h#L345
That's one of the reasons I decided to fork and modify ParallelNative.h (which also runs on main thread explicitly) - doing it at both places is wrong.
aten/src/ATen/ParallelNativeMobile.h
Outdated
internal::calc_num_tasks_and_chunk_size(begin, end, grain_size); | ||
std::vector<scalar_t> results(num_tasks); | ||
scalar_t* results_data = results.data(); | ||
parallel_for( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. ghstack-source-id: 957d6b4 Pull Request resolved: #26702
@@ -89,12 +89,19 @@ std::string get_parallel_info() { | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you add some info into get_parallel_info, e.g. whether we're using a mobile thread pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will append "[mobile]" to aten parallel backend string for mobile build.
if (pool) { | ||
pool->run(fn, range); | ||
} else { | ||
for (size_t i = 0; i < range; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to add some warning logging here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(although, I'm less sure about details of mobile logging - up to you)
} | ||
|
||
int get_thread_num() { | ||
return thread_num_; | ||
} | ||
|
||
bool in_parallel_region() { | ||
#ifndef C10_MOBILE | ||
return in_parallel_region_ || ( | ||
num_intraop_threads.load() == CONSUMED && | ||
internal::_get_intraop_pool().inThreadPool() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could just unify this as return in_parallel_region_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change it because intraop_launch() / intraop_launch_future() don't set in_parallel_region_ - technically those callback functions could call parallel APIs too. We don't implement intraop_launch() for mobile (yet) so it's not an issue for mobile.
#else | ||
// TODO: caffe2::ThreadPool doesn't support submitting tasks separately and | ||
// running in parallel. Should fix it when this API becomes popular. | ||
func(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also same for intraop_launch_future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I don't throw for these APIs is that they are actually called from some unit tests - and we don't necessarily want to block other engs from using these APIs simplify because mobile doesn't support it yet. We could work on it separately as fixing "perf issues" instead of crashes...
aten/src/ATen/ParallelNative.h
Outdated
futures[i] = std::make_shared<c10::ivalue::Future>(NoneType::get()); | ||
} | ||
auto task = [f, &eptr, &err_flag, &futures, begin, end, chunk_size] | ||
(int idx, size_t task_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't seem to be using idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have idx
because caffe2 thread pool API expects this signature. We could create another lambda wrapper to hide it but I feel a bit overkill... I can rename it to "_" if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, please also test native backend version (on server) with
ATEN_THREADING=NATIVE BLAS=MKL USE_MKLDNN=1 python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
and also test_jit
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. ghstack-source-id: 45813de Pull Request resolved: #26702
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: bef4a92 Pull Request resolved: #26702
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]
Landed on master |
…ementation to cpp Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` [ghstack-poisoned]
…ementation to cpp Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: 01dd2d2 Pull Request resolved: #26969
…common implementation to cpp" Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17628089](https://our.internmc.facebook.com/intern/diff/D17628089) [ghstack-poisoned]
…common implementation to cpp" Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17628089](https://our.internmc.facebook.com/intern/diff/D17628089) [ghstack-poisoned]
…ementation to cpp Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: 64b91e0 Pull Request resolved: #26969
Summary: Pull Request resolved: #26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
…ch#26969) Summary: Pull Request resolved: pytorch#26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as pytorch#26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
Summary: Pull Request resolved: #26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
…ch#26969) Summary: Pull Request resolved: pytorch#26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as pytorch#26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: 3dbc395 Pull Request resolved: pytorch/pytorch#26702
Stack from ghstack:
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.
caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.
This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.
Test Plan:
Patch PR [pytorch][perf] Use Caffe2's implementation of grouped depthwise 3x3 convolutions #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.
Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;
The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;
Verified the inference output was the same across all combinations.
Run tests for non-mobile native aten_threading:
Differential Revision: D17543412