Skip to content

[pytorch][perf] add mobile friendly at:parallel_for backend #26702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

ljk53
Copy link
Contributor

@ljk53 ljk53 commented Sep 24, 2019

Stack from ghstack:

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:

  • Patch PR [pytorch][perf] Use Caffe2's implementation of grouped depthwise 3x3 convolutions #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

  • Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
    ** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
    ** before this PR, multi-threaded;
    ** after this PR, single-threaded;
    ** after this PR, multi-threaded;

+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
  • The data confirmed that:
    ** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
    ** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
    ** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
    ** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

  • Verified the inference output was the same across all combinations.

  • Run tests for non-mobile native aten_threading:

ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py

Differential Revision: D17543412

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
on batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- This is early draft to receive feedback. Will do more thorough tests.

[ghstack-poisoned]
@pytorchbot pytorchbot added caffe2 module: build Build system issues module: internals Related to internal abstractions in c10 and ATen module: openmp Related to OpenMP (omp) support in PyTorch labels Sep 24, 2019
ljk53 added a commit that referenced this pull request Sep 24, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
on batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- This is early draft to receive feedback. Will do more thorough tests.

ghstack-source-id: 1a31834
Pull Request resolved: #26702
internal::calc_num_tasks_and_chunk_size(begin, end, grain_size);
std::vector<scalar_t> results(num_tasks);
scalar_t* results_data = results.data();
parallel_for(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I diverged from ParallelNative.h - I used parallel_for() + get_thread_num() to implement it in order to reduce the amount of complicated logic in this PR. Let me know if you think I should take the original approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?

Copy link
Contributor Author

@ljk53 ljk53 Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?

We probably can merge ParallelNative.h and ParallelNativeMobile.h - just implement internal::_run_with_pool API with c10::ThreadPool there. The new implementation doesn't expose those internal APIs like internal::_get_intraop_pool / internal::_set_thread_num / etc in header file.

We probably can keep separate cpp files though as each function is slightly different.

@ilia-cher please let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to refactor internal namespace functions (y)

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- This is early draft to receive feedback. Will do more thorough tests.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 24, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- This is early draft to receive feedback. Will do more thorough tests.

ghstack-source-id: 6615f7c
Pull Request resolved: #26702
@ljk53 ljk53 changed the title [pytorch][[perf] add mobile friendly at:parallel_for backend [pytorch][perf] add mobile friendly at:parallel_for backend Sep 24, 2019
Copy link
Contributor

@AshkanAliabadi AshkanAliabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Awesome work.

@ljk53 ljk53 added this to the 1.3 milestone Sep 24, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- This is early draft to receive feedback. Will do more thorough tests.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 24, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- This is early draft to receive feedback. Will do more thorough tests.

ghstack-source-id: 6615f7c
Pull Request resolved: #26702
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I'll let Ilia to double-check

}
futures[task_id].markCompleted();
};
internal::_run_with_pool(task, num_tasks);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't utilize the main thread to do some compute, do we?

Copy link
Contributor Author

@ljk53 ljk53 Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caffe2's thread pool (gemmlowp clone) runs one task (first one) on current thread directly: https://github.com/pytorch/pytorch/blob/master/caffe2/utils/threadpool/WorkersPool.h#L345

That's one of the reasons I decided to fork and modify ParallelNative.h (which also runs on main thread explicitly) - doing it at both places is wrong.

internal::calc_num_tasks_and_chunk_size(begin, end, grain_size);
std::vector<scalar_t> results(num_tasks);
scalar_t* results_data = results.data();
parallel_for(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only downside is nested lambda, but it shouldn't matter in practice. If it's ok with Ilia - maybe do the same to ParallelNative.h?

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 25, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

ghstack-source-id: 957d6b4
Pull Request resolved: #26702
@@ -89,12 +89,19 @@ std::string get_parallel_info() {
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add some info into get_parallel_info, e.g. whether we're using a mobile thread pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will append "[mobile]" to aten parallel backend string for mobile build.

if (pool) {
pool->run(fn, range);
} else {
for (size_t i = 0; i < range; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to add some warning logging here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although, I'm less sure about details of mobile logging - up to you)

}

int get_thread_num() {
return thread_num_;
}

bool in_parallel_region() {
#ifndef C10_MOBILE
return in_parallel_region_ || (
num_intraop_threads.load() == CONSUMED &&
internal::_get_intraop_pool().inThreadPool()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could just unify this as return in_parallel_region_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change it because intraop_launch() / intraop_launch_future() don't set in_parallel_region_ - technically those callback functions could call parallel APIs too. We don't implement intraop_launch() for mobile (yet) so it's not an issue for mobile.

#else
// TODO: caffe2::ThreadPool doesn't support submitting tasks separately and
// running in parallel. Should fix it when this API becomes popular.
func();
Copy link
Contributor

@ilia-cher ilia-cher Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also throw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also same for intraop_launch_future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I don't throw for these APIs is that they are actually called from some unit tests - and we don't necessarily want to block other engs from using these APIs simplify because mobile doesn't support it yet. We could work on it separately as fixing "perf issues" instead of crashes...

futures[i] = std::make_shared<c10::ivalue::Future>(NoneType::get());
}
auto task = [f, &eptr, &err_flag, &futures, begin, end, chunk_size]
(int idx, size_t task_id) {
Copy link
Contributor

@ilia-cher ilia-cher Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't seem to be using idx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have idx because caffe2 thread pool API expects this signature. We could create another lambda wrapper to hide it but I feel a bit overkill... I can rename it to "_" if you prefer.

Copy link
Contributor

@ilia-cher ilia-cher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, please also test native backend version (on server) with
ATEN_THREADING=NATIVE BLAS=MKL USE_MKLDNN=1 python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
and also test_jit

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 25, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
 * before this PR, single-threaded (caffe2_threadpool_force_inline=true);
 * before this PR, multi-threaded;
 * after this PR, single-threaded;
 * after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
 * at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
 * at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
 * 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
 * depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

ghstack-source-id: 45813de
Pull Request resolved: #26702
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 25, 2019
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

ghstack-source-id: bef4a92
Pull Request resolved: #26702
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: [D17543412](https://our.internmc.facebook.com/intern/diff/D17543412)

[ghstack-poisoned]
@ljk53
Copy link
Contributor Author

ljk53 commented Sep 26, 2019

Landed on master

@ljk53 ljk53 closed this Sep 26, 2019
ljk53 added a commit that referenced this pull request Sep 27, 2019
…ementation to cpp

Summary:
template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 27, 2019
…ementation to cpp

Summary:
template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

ghstack-source-id: 01dd2d2
Pull Request resolved: #26969
ljk53 added a commit that referenced this pull request Sep 28, 2019
…common implementation to cpp"

Summary:
template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: [D17628089](https://our.internmc.facebook.com/intern/diff/D17628089)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 28, 2019
…common implementation to cpp"

Summary:
template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: [D17628089](https://our.internmc.facebook.com/intern/diff/D17628089)

[ghstack-poisoned]
ljk53 added a commit that referenced this pull request Sep 28, 2019
…ementation to cpp

Summary:
template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

ghstack-source-id: 64b91e0
Pull Request resolved: #26969
facebook-github-bot pushed a commit that referenced this pull request Sep 28, 2019
Summary:
Pull Request resolved: #26969

template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: D17628089

Pulled By: ljk53

fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
ljk53 added a commit to ljk53/pytorch that referenced this pull request Sep 28, 2019
…ch#26969)

Summary:
Pull Request resolved: pytorch#26969

template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as pytorch#26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: D17628089

Pulled By: ljk53

fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
soumith pushed a commit that referenced this pull request Oct 4, 2019
Summary:
Pull Request resolved: #26969

template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as #26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: D17628089

Pulled By: ljk53

fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
@facebook-github-bot facebook-github-bot deleted the gh/ljk53/56/head branch October 28, 2019 22:16
pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request Nov 27, 2019
…ch#26969)

Summary:
Pull Request resolved: pytorch#26969

template got inflated into many places. This PR extracted out common
implementation that doesn't depend on template param.

After:
Compressed ARMv7 AAR size: 5,677,469->5,398,011
RAW libpytorch.so size: 16,862,108->16,047,004

Test Plan:
- Test perf/correctness as pytorch#26702;

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

Differential Revision: D17628089

Pulled By: ljk53

fbshipit-source-id: 987d1f28174870384d6642d0bd4912b138348f66
xxtEchjovs44 pushed a commit to xxtEchjovs44/pytorch that referenced this pull request Jan 29, 2020
Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan:
- Patch PR #26556 and use at::parallel_for() to run depthwise-3x3-winograd kernel across groups.

- Measured time taken by each convolution layer in MobileNetV2 with the follow combinations:
** before this PR, single-threaded (caffe2_threadpool_force_inline=true);
** before this PR, multi-threaded;
** after this PR, single-threaded;
** after this PR, multi-threaded;

```
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|       Input       |      Kernel       | Groups | Stride | Wino- | Before | Before | After  | After  |
|                   |                   |        |        | grad? | Single | Multi  | Single | Multi  |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
| [1, 3, 224, 224]  | [32, 3, 3, 3]     |      1 |      2 |       |   7270 |  10816 |   7471 |  11071 |
| [1, 32, 112, 112] | [32, 1, 3, 3]     |     32 |      1 | Yes   |   7563 |  11300 |   8074 |   3153 |
| [1, 32, 112, 112] | [16, 32, 1, 1]    |      1 |      1 |       |   4347 |   2274 |   4348 |   4441 |
| [1, 16, 112, 112] | [96, 16, 1, 1]    |      1 |      1 |       |  14352 |  10947 |  14221 |  12739 |
| [1, 96, 112, 112] | [96, 1, 3, 3]     |     96 |      2 |       |  20202 |  30038 |  22218 |  33296 |
| [1, 96, 56, 56]   | [24, 96, 1, 1]    |      1 |      1 |       |   4620 |   5485 |   4595 |   7585 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7526 |   7909 |   7538 |   3772 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      1 | Yes   |   8870 |  13318 |   9295 |   7035 |
| [1, 144, 56, 56]  | [24, 144, 1, 1]   |      1 |      1 |       |   6823 |   6430 |   6795 |   7067 |
| [1, 24, 56, 56]   | [144, 24, 1, 1]   |      1 |      1 |       |   7561 |  12839 |   7528 |   9646 |
| [1, 144, 56, 56]  | [144, 1, 3, 3]    |    144 |      2 |       |   8879 |  13371 |   9774 |  14753 |
| [1, 144, 28, 28]  | [32, 144, 1, 1]   |      1 |      1 |       |   2297 |   1408 |   2282 |   1926 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3287 |   7740 |   3256 |   1873 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3245 |   4892 |   3294 |   1440 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3032 |   4494 |   3010 |   6376 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   3227 |   3277 |   1780 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      1 | Yes   |   3237 |   4912 |   3284 |   1500 |
| [1, 192, 28, 28]  | [32, 192, 1, 1]   |      1 |      1 |       |   3060 |   3538 |   3008 |   5658 |
| [1, 32, 28, 28]   | [192, 32, 1, 1]   |      1 |      1 |       |   3273 |   6590 |   3281 |   1466 |
| [1, 192, 28, 28]  | [192, 1, 3, 3]    |    192 |      2 |       |   6202 |   9395 |   6580 |  10155 |
| [1, 192, 14, 14]  | [64, 192, 1, 1]   |      1 |      1 |       |   1568 |   2030 |   1562 |   1560 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3188 |   5217 |   3168 |   1501 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1889 |   2867 |   1997 |    776 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3053 |   1441 |   3053 |   1547 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3182 |   1188 |   3183 |   1512 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1907 |   2860 |   1992 |    794 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3046 |   1168 |   3037 |   1474 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3225 |   1190 |   3186 |   1599 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1899 |   2856 |   2014 |    773 |
| [1, 384, 14, 14]  | [64, 384, 1, 1]   |      1 |      1 |       |   3042 |   1169 |   3050 |   1494 |
| [1, 64, 14, 14]   | [384, 64, 1, 1]   |      1 |      1 |       |   3209 |   1163 |   3191 |   1507 |
| [1, 384, 14, 14]  | [384, 1, 3, 3]    |    384 |      1 | Yes   |   1879 |   2832 |   1994 |    774 |
| [1, 384, 14, 14]  | [96, 384, 1, 1]   |      1 |      1 |       |   4596 |   2023 |   4606 |   2310 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7081 |   2429 |   7012 |   3086 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2828 |   4269 |   3010 |   1145 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   6989 |   2897 |   6973 |   3491 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7084 |   2446 |   7078 |   3106 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      1 | Yes   |   2853 |   4281 |   3006 |   1405 |
| [1, 576, 14, 14]  | [96, 576, 1, 1]   |      1 |      1 |       |   7003 |   7734 |   6995 |   3575 |
| [1, 96, 14, 14]   | [576, 96, 1, 1]   |      1 |      1 |       |   7082 |   2514 |   7059 |   3032 |
| [1, 576, 14, 14]  | [576, 1, 3, 3]    |    576 |      2 |       |  14518 |  21718 |  14985 |  22810 |
| [1, 576, 7, 7]    | [160, 576, 1, 1]  |      1 |      1 |       |   3043 |   7370 |   3034 |   1951 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5164 |   5660 |   5151 |   2252 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2722 |   1900 |    735 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4935 |   1943 |   4943 |   2134 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5140 |   1817 |   5192 |   2325 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1786 |   2732 |   1871 |    725 |
| [1, 960, 7, 7]    | [160, 960, 1, 1]  |      1 |      1 |       |   4919 |   1881 |   4916 |   2202 |
| [1, 160, 7, 7]    | [960, 160, 1, 1]  |      1 |      1 |       |   5201 |   1816 |   5186 |   2322 |
| [1, 960, 7, 7]    | [960, 1, 3, 3]    |    960 |      1 | Yes   |   1808 |   2718 |   1871 |    726 |
| [1, 960, 7, 7]    | [320, 960, 1, 1]  |      1 |      1 |       |   9883 |   3567 |   9860 |   4303 |
| [1, 320, 7, 7]    | [1280, 320, 1, 1] |      1 |      1 |       |  13407 |   4789 |  13384 |   5857 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
|  Total                                                          | 277112 | 284230 | 282588 | 231535 |
+-------------------+-------------------+--------+--------+-------+--------+--------+--------+--------+
```

- The data confirmed that:
** at::parallel_for() accelerated inference in multi-threaded mode (only integrated it for winograd kernel);
** at::parallel_for() didn't add much overhead in single-threaded mode (winograd, before v.s. after);
** 1x1 kernels were using NNPACK - multi-threaded mode was faster in most cases;
** depthwise-strided kernels didn't use winograd kernel - multi-threaded mode was still slower;

- Verified the inference output was the same across all combinations.

- Run tests for non-mobile native aten_threading:
```
ATEN_THREADING=NATIVE python setup.py develop --cmake
pytest -s -v test/test_torch.py::TestTorch
pytest -s -v test/test_jit.py
```

ghstack-source-id: 3dbc395
Pull Request resolved: pytorch/pytorch#26702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caffe2 module: build Build system issues module: internals Related to internal abstractions in c10 and ATen module: openmp Related to OpenMP (omp) support in PyTorch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants