[pytorch][perf] add mobile friendly at:parallel_for backend #26702

ljk53 · 2019-09-24T02:08:10Z

Stack from ghstack:

[pytorch][mobile] remove backward functions from jit-op-registry for mobile build #26783 [pytorch][mobile] remove backward functions from jit-op-registry for mobile build
[pytorch][perf] add mobile friendly at:parallel_for backend #26702 [pytorch][perf] add mobile friendly at:parallel_for backend

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:

Patch PR [pytorch][perf] Use Caffe2's implementation of grouped depthwise 3x3 convolutions #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.
Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;

+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+

The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;
Verified the inference output was the same across all combinations.
Run tests for non-mobile native aten_threading:

ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py

Differential Revision: D17543412

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in on batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. [ghstack-poisoned]

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in on batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. ghstack-source-id: 1a31834 Pull Request resolved: #26702

aten/src/ATen/core/ivalue_inl.h

ljk53 · 2019-09-24T02:11:59Z

aten/src/ATen/ParallelNativeMobile.h

+        internal::calc_num_tasks_and_chunk_size(begin, end, grain_size);
+    std::vector<scalar_t> results(num_tasks);
+    scalar_t* results_data = results.data();
+    parallel_for(


Here I diverged from ParallelNative.h - I used parallel_for() + get_thread_num() to implement it in order to reduce the amount of complicated logic in this PR. Let me know if you think I should take the original approach.

I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?

I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?

We probably can merge ParallelNative.h and ParallelNativeMobile.h - just implement internal::_run_with_pool API with c10::ThreadPool there. The new implementation doesn't expose those internal APIs like internal::_get_intraop_pool / internal::_set_thread_num / etc in header file.

We probably can keep separate cpp files though as each function is slightly different.

@ilia-cher please let me know your thoughts.

feel free to refactor internal namespace functions (y)

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. ghstack-source-id: 6615f7c Pull Request resolved: #26702

AshkanAliabadi

This is great! Awesome work.

aten/src/ATen/ParallelNativeMobile.h

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - This is early draft to receive feedback. Will do more thorough tests. ghstack-source-id: 6615f7c Pull Request resolved: #26702

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

dzhulgakov

Looks good to me, I'll let Ilia to double-check

aten/src/ATen/ParallelCommon.cpp

dzhulgakov · 2019-09-25T04:01:54Z

aten/src/ATen/ParallelNativeMobile.h

+      }
+      futures[task_id].markCompleted();
+    };
+    internal::_run_with_pool(task, num_tasks);


we don't utilize the main thread to do some compute, do we?

caffe2's thread pool (gemmlowp clone) runs one task (first one) on current thread directly: https://github.com/pytorch/pytorch/blob/master/caffe2/utils/threadpool/WorkersPool.h#L345

That's one of the reasons I decided to fork and modify ParallelNative.h (which also runs on main thread explicitly) - doing it at both places is wrong.

dzhulgakov · 2019-09-25T04:03:31Z

aten/src/ATen/ParallelNativeMobile.h

+        internal::calc_num_tasks_and_chunk_size(begin, end, grain_size);
+    std::vector<scalar_t> results(num_tasks);
+    scalar_t* results_data = results.data();
+    parallel_for(


I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. ghstack-source-id: 957d6b4 Pull Request resolved: #26702

ilia-cher · 2019-09-25T06:07:59Z

aten/src/ATen/ParallelCommon.cpp

@@ -89,12 +89,19 @@ std::string get_parallel_info() {
 }



nit: could you add some info into get_parallel_info, e.g. whether we're using a mobile thread pool

sure, will append "[mobile]" to aten parallel backend string for mobile build.

ilia-cher · 2019-09-25T06:17:14Z

aten/src/ATen/ParallelNative.cpp

+  if (pool) {
+    pool->run(fn, range);
+  } else {
+    for (size_t i = 0; i < range; ++i) {


we might want to add some warning logging here

(although, I'm less sure about details of mobile logging - up to you)

ilia-cher · 2019-09-25T06:23:52Z

aten/src/ATen/ParallelNative.cpp

 }

 int get_thread_num() {
  return thread_num_;
 }

 bool in_parallel_region() {
+#ifndef C10_MOBILE
  return in_parallel_region_ || (
    num_intraop_threads.load() == CONSUMED &&
    internal::_get_intraop_pool().inThreadPool()
  );


we could just unify this as return in_parallel_region_

I didn't change it because intraop_launch() / intraop_launch_future() don't set in_parallel_region_ - technically those callback functions could call parallel APIs too. We don't implement intraop_launch() for mobile (yet) so it's not an issue for mobile.

ilia-cher · 2019-09-25T06:24:44Z

aten/src/ATen/ParallelNative.cpp

+#else
+  // TODO: caffe2::ThreadPool doesn't support submitting tasks separately and
+  // running in parallel. Should fix it when this API becomes popular.
+  func();


maybe also throw?

also same for intraop_launch_future

The reason I don't throw for these APIs is that they are actually called from some unit tests - and we don't necessarily want to block other engs from using these APIs simplify because mobile doesn't support it yet. We could work on it separately as fixing "perf issues" instead of crashes...

ilia-cher · 2019-09-25T06:33:38Z

aten/src/ATen/ParallelNative.h

+    futures[i] = std::make_shared<c10::ivalue::Future>(NoneType::get());
+  }
+  auto task = [f, &eptr, &err_flag, &futures, begin, end, chunk_size]
+      (int idx, size_t task_id) {


nit: we don't seem to be using idx?

We have idx because caffe2 thread pool API expects this signature. We could create another lambda wrapper to hide it but I feel a bit overkill... I can rename it to "_" if you prefer.

ilia-cher

LG, please also test native backend version (on server) with
ATEN_THREADING=NATIVE BLAS=MKL USE_MKLDNN=1 python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
and also test_jit

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: * before this PR, single-threaded (caffe2_threadpool_force_inline=true); * before this PR, multi-threaded; * after this PR, single-threaded; * after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. ghstack-source-id: 45813de Pull Request resolved: #26702

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: bef4a92 Pull Request resolved: #26702

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412) [ghstack-poisoned]

ljk53 · 2019-09-26T06:42:09Z

Landed on master

…ementation to cpp Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` [ghstack-poisoned]

…ementation to cpp Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: 01dd2d2 Pull Request resolved: #26969

…common implementation to cpp" Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: [D17628089](https://our.internmc.facebook.com/intern/diff/D17628089) [ghstack-poisoned]

…ementation to cpp Summary: template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: 64b91e0 Pull Request resolved: #26969

Summary: Pull Request resolved: #26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66

…ch#26969) Summary: Pull Request resolved: pytorch#26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as pytorch#26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66

Summary: Pull Request resolved: #26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as #26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66

…ch#26969) Summary: Pull Request resolved: pytorch#26969 template got inflated into many places. This PR extracted out common implementation that doesn't depend on template param. After: Compressed ARMv7 AAR size: 5,677,469->5,398,011 RAW libpytorch.so size: 16,862,108->16,047,004 Test Plan: - Test perf/correctness as pytorch#26702; - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` Differential Revision: D17628089 Pulled By: ljk53 fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66

Summary: This diff implemented at::parallel_for()/parallel_reduce() and other ATen/Parallel.h APIs for mobile using caffe2::ThreadPool. caffe2::ThreadPool doesn't support submitting individual tasks separately and running them in parallel - all tasks need to be submit in one batch which will lock the thread pool until all of them finish - as a result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface and reuse at::parallel_for() implementation in ParallelNative.h. Because of this constraint, intraop_launch() / intraop_launch_future() are not supported yet. This diff doesn't touch inter-ops pool - it's still default native c10 thread pool. Will work on it when it's widely used. Test Plan: - Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups. - Measured time taken by each convolution layer in MobileNetV2 with the follow combinations: ** before this PR, single-threaded (caffe2_threadpool_force_inline=true); ** before this PR, multi-threaded; ** after this PR, single-threaded; ** after this PR, multi-threaded; ``` +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Input | Kernel | Groups | Stride | Wino- | Before | Before | After | After | | | | | | grad? | Single | Multi | Single | Multi | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | [1, 3, 224, 224] | [32, 3, 3, 3] | 1 | 2 | | 7270 | 10816 | 7471 | 11071 | | [1, 32, 112, 112] | [32, 1, 3, 3] | 32 | 1 | Yes | 7563 | 11300 | 8074 | 3153 | | [1, 32, 112, 112] | [16, 32, 1, 1] | 1 | 1 | | 4347 | 2274 | 4348 | 4441 | | [1, 16, 112, 112] | [96, 16, 1, 1] | 1 | 1 | | 14352 | 10947 | 14221 | 12739 | | [1, 96, 112, 112] | [96, 1, 3, 3] | 96 | 2 | | 20202 | 30038 | 22218 | 33296 | | [1, 96, 56, 56] | [24, 96, 1, 1] | 1 | 1 | | 4620 | 5485 | 4595 | 7585 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7526 | 7909 | 7538 | 3772 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 1 | Yes | 8870 | 13318 | 9295 | 7035 | | [1, 144, 56, 56] | [24, 144, 1, 1] | 1 | 1 | | 6823 | 6430 | 6795 | 7067 | | [1, 24, 56, 56] | [144, 24, 1, 1] | 1 | 1 | | 7561 | 12839 | 7528 | 9646 | | [1, 144, 56, 56] | [144, 1, 3, 3] | 144 | 2 | | 8879 | 13371 | 9774 | 14753 | | [1, 144, 28, 28] | [32, 144, 1, 1] | 1 | 1 | | 2297 | 1408 | 2282 | 1926 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3287 | 7740 | 3256 | 1873 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3245 | 4892 | 3294 | 1440 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3032 | 4494 | 3010 | 6376 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 3227 | 3277 | 1780 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 1 | Yes | 3237 | 4912 | 3284 | 1500 | | [1, 192, 28, 28] | [32, 192, 1, 1] | 1 | 1 | | 3060 | 3538 | 3008 | 5658 | | [1, 32, 28, 28] | [192, 32, 1, 1] | 1 | 1 | | 3273 | 6590 | 3281 | 1466 | | [1, 192, 28, 28] | [192, 1, 3, 3] | 192 | 2 | | 6202 | 9395 | 6580 | 10155 | | [1, 192, 14, 14] | [64, 192, 1, 1] | 1 | 1 | | 1568 | 2030 | 1562 | 1560 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3188 | 5217 | 3168 | 1501 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1889 | 2867 | 1997 | 776 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3053 | 1441 | 3053 | 1547 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3182 | 1188 | 3183 | 1512 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1907 | 2860 | 1992 | 794 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3046 | 1168 | 3037 | 1474 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3225 | 1190 | 3186 | 1599 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1899 | 2856 | 2014 | 773 | | [1, 384, 14, 14] | [64, 384, 1, 1] | 1 | 1 | | 3042 | 1169 | 3050 | 1494 | | [1, 64, 14, 14] | [384, 64, 1, 1] | 1 | 1 | | 3209 | 1163 | 3191 | 1507 | | [1, 384, 14, 14] | [384, 1, 3, 3] | 384 | 1 | Yes | 1879 | 2832 | 1994 | 774 | | [1, 384, 14, 14] | [96, 384, 1, 1] | 1 | 1 | | 4596 | 2023 | 4606 | 2310 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7081 | 2429 | 7012 | 3086 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2828 | 4269 | 3010 | 1145 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 6989 | 2897 | 6973 | 3491 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7084 | 2446 | 7078 | 3106 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 1 | Yes | 2853 | 4281 | 3006 | 1405 | | [1, 576, 14, 14] | [96, 576, 1, 1] | 1 | 1 | | 7003 | 7734 | 6995 | 3575 | | [1, 96, 14, 14] | [576, 96, 1, 1] | 1 | 1 | | 7082 | 2514 | 7059 | 3032 | | [1, 576, 14, 14] | [576, 1, 3, 3] | 576 | 2 | | 14518 | 21718 | 14985 | 22810 | | [1, 576, 7, 7] | [160, 576, 1, 1] | 1 | 1 | | 3043 | 7370 | 3034 | 1951 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5164 | 5660 | 5151 | 2252 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2722 | 1900 | 735 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4935 | 1943 | 4943 | 2134 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5140 | 1817 | 5192 | 2325 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1786 | 2732 | 1871 | 725 | | [1, 960, 7, 7] | [160, 960, 1, 1] | 1 | 1 | | 4919 | 1881 | 4916 | 2202 | | [1, 160, 7, 7] | [960, 160, 1, 1] | 1 | 1 | | 5201 | 1816 | 5186 | 2322 | | [1, 960, 7, 7] | [960, 1, 3, 3] | 960 | 1 | Yes | 1808 | 2718 | 1871 | 726 | | [1, 960, 7, 7] | [320, 960, 1, 1] | 1 | 1 | | 9883 | 3567 | 9860 | 4303 | | [1, 320, 7, 7] | [1280, 320, 1, 1] | 1 | 1 | | 13407 | 4789 | 13384 | 5857 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ | Total | 277112 | 284230 | 282588 | 231535 | +-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+ ``` - The data confirmed that: ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel); ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after); ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases; ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower; - Verified the inference output was the same across all combinations. - Run tests for non-mobile native aten_threading: ``` ATEN_THREADING=NATIVE python setup.py develop --cmake pytest -s -v test/test_torch.py::TestTorch pytest -s -v test/test_jit.py ``` ghstack-source-id: 3dbc395 Pull Request resolved: pytorch/pytorch#26702

pytorchbot added caffe2 module: build Build system issues module: internals Related to internal abstractions in c10 and ATen module: openmp Related to OpenMP (omp) support in PyTorch labels Sep 24, 2019

This was referenced Sep 24, 2019

[pytorch][mobile] remove backward functions from jit-op-registry for mobile build #26657

Merged

[pytorch] rename caffe2::mobile_threadpool to caffe2::mobile_pthreadpool #26700

Closed

ljk53 requested review from ilia-cher, dzhulgakov, supriyar, AshkanAliabadi and dreiss September 24, 2019 02:08

ljk53 commented Sep 24, 2019

View reviewed changes

ljk53 changed the title ~~[pytorch][[perf] add mobile friendly at:parallel_for backend~~ [pytorch][perf] add mobile friendly at:parallel_for backend Sep 24, 2019

AshkanAliabadi reviewed Sep 24, 2019

View reviewed changes

aten/src/ATen/ParallelNativeMobile.h Outdated Show resolved Hide resolved

ljk53 added this to the 1.3 milestone Sep 24, 2019

ljk53 mentioned this pull request Sep 24, 2019

[pytorch] fix ParallelNative.h/cpp build #26746

Closed

dzhulgakov reviewed Sep 25, 2019

View reviewed changes

ilia-cher reviewed Sep 25, 2019

View reviewed changes

ilia-cher approved these changes Sep 25, 2019

View reviewed changes

ljk53 added 2 commits September 25, 2019 14:29

ljk53 closed this Sep 26, 2019

ljk53 mentioned this pull request Sep 27, 2019

[pytorch][perf] Disable threading on convolutions going through NNPACK. #26940

Closed

ljk53 mentioned this pull request Sep 27, 2019

[pytorch][mobile][size] move parallel_for/parallel_reduce common implementation to cpp #26969

Closed

ljk53 mentioned this pull request Sep 28, 2019

[v1.3.0] move parallel_for/parallel_reduce common implementation to cpp (#26969) #27014

Closed

facebook-github-bot deleted the gh/ljk53/56/head branch October 28, 2019 22:16

[pytorch][perf] add mobile friendly at:parallel_for backend #26702

[pytorch][perf] add mobile friendly at:parallel_for backend #26702

Uh oh!

Conversation

ljk53 commented Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljk53 Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AshkanAliabadi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ljk53 Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilia-cher Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilia-cher Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilia-cher left a comment

Choose a reason for hiding this comment

Uh oh!

ljk53 commented Sep 26, 2019

Uh oh!

Uh oh!

ljk53 commented Sep 24, 2019 •

edited

Loading

ljk53 Sep 25, 2019 •

edited

Loading

ljk53 Sep 25, 2019 •

edited

Loading

ilia-cher Sep 25, 2019 •

edited

Loading

ilia-cher Sep 25, 2019 •

edited

Loading