NHWC memory layout support and XNNPACK integration for mobile #30644

AshkanAliabadi · 2019-12-03T03:16:35Z

Gathering CI signal for now ...

kostmo · 2019-12-13T03:30:56Z

💊 CircleCI build failures summary and remediations

As of commit 69f5e7d:

4/4 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 4 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakage:

pytorch_linux_xenial_py3_clang5_mobile_build (1/4)

Step: "Build" (full log | pattern match details)

Dec 30 22:55:47 operator-run.c:(.text+0x19f4): undefined reference to `pthreadpool_compute_2d'

Dec 30 22:55:47 operator-run.c:(.text+0xedf): undefined reference to `pthreadpool_compute_1d' 
Dec 30 22:55:47 operator-run.c:(.text+0xfb5): undefined reference to `pthreadpool_compute_1d' 
Dec 30 22:55:47 operator-run.c:(.text+0x10c9): undefined reference to `pthreadpool_compute_2d' 
Dec 30 22:55:47 operator-run.c:(.text+0x1359): undefined reference to `pthreadpool_compute_2d' 
Dec 30 22:55:47 operator-run.c:(.text+0x1435): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x1543): undefined reference to `pthreadpool_compute_4d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x16e3): undefined reference to `pthreadpool_compute_4d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x17d6): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x1834): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x18b3): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x19f4): undefined reference to `pthreadpool_compute_2d' 
Dec 30 22:55:47 collect2: error: ld returned 1 exit status 
Dec 30 22:55:47 CMakeFiles/main.dir/build.make:102: recipe for target 'main' failed 
Dec 30 22:55:47 make[2]: *** [main] Error 1 
Dec 30 22:55:47 CMakeFiles/Makefile2:104: recipe for target 'CMakeFiles/main.dir/all' failed 
Dec 30 22:55:47 make[1]: *** [CMakeFiles/main.dir/all] Error 2 
Dec 30 22:55:47 Makefile:127: recipe for target 'all' failed 
Dec 30 22:55:47 make: *** [all] Error 2 
Dec 30 22:55:47 + sccache_epilogue 
Dec 30 22:55:47 + echo '=================== sccache compilation log ===================' 
Dec 30 22:55:47 =================== sccache compilation log ===================

pytorch_windows_build (2/4)

Step: "Build" (full log | pattern match details)

FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/qlinear.cpp.obj

C:\Users\circleci\project\aten\src\ATen/native/mobile/internal/ThreadPool.h(4): fatal error C1083: Cannot open include file: 'pthreadpool.h': No such file or directory
N -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\detail\CUDAHooksInterface.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\detail\CUDAHooksInterface.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.11.25548.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

-DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qconv.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qconv.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/qconv.cpp.obj  
-DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qconv.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qconv.cpp 
C:\Users\circleci\project\aten\src\ATen/native/mobile/internal/ThreadPool.h(4): fatal error C1083: Cannot open include file: 'pthreadpool.h': No such file or directory
VE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qlinear.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qlinear.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/qlinear.cpp.obj  
VE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qlinear.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qlinear.cpp 
C:\Users\circleci\project\aten\src\ATen/native/mobile/internal/ThreadPool.h(4): fatal error C1083: Cannot open include file: 'pthreadpool.h': No such file or directory
p -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\core\Dimname.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\core\Dimname.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.11.25548.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

U_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\q_adaavgpool.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\q_adaavgpool.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.11.25548.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32 (3/4)

Step: "pytorch android gradle build only x86_32 (for PR)" (full log | pattern match details)

Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build failed with an exception.

Dec 30 23:33:55 23:33:43.493 [DEBUG] [org.gradle.api.internal.changedetection.state.CacheBackedTaskHistoryRepository] Fingerprinting property otherFolderOutputs.otherFolderOutput1 for task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease' 
Dec 30 23:33:55 23:33:43.493 [DEBUG] [org.gradle.api.internal.changedetection.state.CacheBackedTaskHistoryRepository] Fingerprinting property streamOutputFolder (Output) for task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease' 
Dec 30 23:33:55 23:33:43.493 [INFO] [org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter] Task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease' is not up-to-date because: 
Dec 30 23:33:55   No history is available. 
Dec 30 23:33:55 23:33:43.493 [DEBUG] [org.gradle.api.internal.tasks.execution.OutputDirectoryCreatingTaskExecuter] Ensuring directory exists for property otherFolderOutputs.otherFolderOutput1 at /var/lib/jenkins/workspace/android/libs/fbjni_local/build/intermediates/incremental/release-mergeJniLibs/zip-cache 
Dec 30 23:33:55 23:33:43.494 [DEBUG] [org.gradle.api.internal.tasks.execution.OutputDirectoryCreatingTaskExecuter] Ensuring directory exists for property streamOutputFolder (Output) at /var/lib/jenkins/workspace/android/libs/fbjni_local/build/intermediates/transforms/mergeJniLibs/release 
Dec 30 23:33:55 23:33:43.494 [DEBUG] [org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter] Executing actions for task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease'. 
Dec 30 23:33:55 23:33:43.494 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Build operation 'Execute transform for :fbjni:transformNativeLibsWithMergeJniLibsForRelease' started 
Dec 30 23:33:55 23:33:43.494 [INFO] [org.gradle.api.internal.changedetection.changes.RebuildIncrementalTaskInputs] All input files are considered out-of-date for incremental task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease'. 
Dec 30 23:33:55 23:33:43.522 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Completing Build operation 'Execute transform for :fbjni:transformNativeLibsWithMergeJniLi23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build failed with an exception. 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] * What went wrong: 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] Execution failed for task ':pytorch_android:externalNativeBuildRelease'. 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] > Build command failed. 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]   Error while executing process /usr/local/bin/cmake with arguments {--build /var/lib/jenkins/workspace/android/pytorch_android/.externalNativeBuild/cmake/release/x86 --target pytorch_jni} 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]    
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]   ninja: error: '../../../../src/main/jniLibs/x86/libnnpack.a', needed by '../../../../build/intermediates/cmake/release/obj/x86/libpytorch_jni.so', missing and no known rule to make it 
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] * Try:

pytorch_linux_backward_compatibility_check_test (4/4)

Step: "Test" (full log | pattern match details)

Dec 30 23:43:17 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.

Dec 30 23:43:17 processing existing schema:  aten::values(Dict(int, t) self) -> (t[](*)) 
Dec 30 23:43:17 processing existing schema:  aten::values(Dict(float, t) self) -> (t[](*)) 
Dec 30 23:43:17 processing existing schema:  aten::values(Dict(Tensor, t) self) -> (t[](*)) 
Dec 30 23:43:17 processing existing schema:  aten::view(Tensor(a) self, int[] size) -> (Tensor(a)) 
Dec 30 23:43:17 processing existing schema:  aten::where(Tensor condition) -> (Tensor[]) 
Dec 30 23:43:17 processing existing schema:  aten::where.self(Tensor condition, Tensor self, Tensor other) -> (Tensor) 
Dec 30 23:43:17 processing existing schema:  aten::zeros_like(Tensor self, *, int? memory_format=None) -> (Tensor) 
Dec 30 23:43:17 processing existing schema:  aten::zeros_like.dtype(Tensor self, *, int dtype, int layout, Device device, bool pin_memory=False, int? memory_format=None) -> (Tensor) 
Dec 30 23:43:17 processing existing schema:  quantized::add(Tensor qa, Tensor qb, float scale, int zero_point) -> (Tensor qc) 
Dec 30 23:43:17 processing existing schema:  quantized::add_relu(Tensor qa, Tensor qb, float scale, int zero_point) -> (Tensor qc) 
Dec 30 23:43:17 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Dec 30 23:43:17  
Dec 30 23:43:17 Broken ops: [ 
Dec 30 23:43:17 	aten::_nnpack_available() -> (bool) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution_backward_input(Tensor input, Tensor grad_output, Tensor weight, int[2] padding) -> (Tensor) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution(Tensor input, Tensor weight, Tensor? bias, int[2] padding, int[2] stride=[1, 1]) -> (Tensor) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution_backward_weight(Tensor input, int[] weightsize, Tensor grad_output, int[2] padding) -> (Tensor) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution_backward(Tensor input, Tensor grad_output, Tensor weight, int[2] padding, bool[3] output_mask) -> (Tensor, Tensor, Tensor) 
Dec 30 23:43:17 ] 
Dec 30 23:43:17 =================== sccache compilation log =================== 
Dec 30 23:43:17 + cleanup

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 177 times.

AshkanAliabadi · 2020-01-05T04:46:48Z

Please review this code or areas you are interested in. Any help appreciated.

AshkanAliabadi

Publishing to collect feedback.

AshkanAliabadi · 2020-01-05T04:47:55Z

aten/src/ATen/ParallelNative.cpp

@@ -5,8 +5,7 @@
 #ifndef C10_MOBILE


@ilia-cher Please take a look.

AshkanAliabadi · 2020-01-05T04:49:29Z

aten/src/ATen/ParallelNative.cpp

-      fn(0, i);
-    }
-  }
+  native::mobile::internal::threadpool().run(fn, range);


@ilia-cher What would be the repercussions of not passing the thread ID to the callback? pthreadpool, the lower level threading library we use on mobile, does not support that feature and I am wondering whether that matters enough to go through the trouble of supporting that use case.

I think there is an internal API to query thread-ID: get_thread_num() https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Parallel.h#L30
Do you refer to it or some other API?

Here is one sample call-site of get_thread_num() API in aten: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIteratorReduce.cpp#L34
I think it is used to perform some ad-hoc reduction - not sure if it's actually needed by any mobile model for real.

I remember right now we are using a thread-local to mimic this behavior - I'm not sure it needs depend on specific thread pool implementation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L105

Took a closer took, seems you refer to the first param (thread_pool_task_id) of the fn callback of _run_with_pool - I think this callback is only used in this file by _parallel_run, the first param is marked as unused there: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L135

If we move to using pthreadpool which doesn't pass thread-id to fn then we can simply remove it. It's doesn't seem to be in the public API of parallel_for callback signature (which handles task_id instead of thread_id).

AshkanAliabadi · 2020-01-05T04:50:22Z

aten/src/ATen/native/Activation.cpp

@@ -24,14 +25,26 @@ DEFINE_DISPATCH(leaky_relu_stub);
 DEFINE_DISPATCH(leaky_relu_backward_stub);

 Tensor hardtanh(const Tensor& self, Scalar min, Scalar max) {
+  // if (mobile::cpu::use_clamp(self, min, max)) {
+  //   return mobile::cpu::clamp(self, min, max);
+  // }


Will enable all these. And all the commented out lines below in this diff.

AshkanAliabadi · 2020-01-05T04:52:32Z

aten/src/ATen/native/Convolution.cpp

+  //   output = mobile::cpu::convolution(
+  //       input, weight, bias,
+  //       params.padding, params.stride, params.dilation, params.groups, params.transposed);
+


The mobile hook for convolutions. This code path is not ideal since portions of the convolutions can be calculated once and cached - a major cause for performance uplift. This codepath cannot do that. Ideally we want to use newly introduced modules at the end of this diff, or even better, use JIT.

AshkanAliabadi · 2020-01-05T04:54:03Z

aten/src/ATen/native/Convolution.cpp

+           for non-dilated case here */
+        return at::thnn_conv2d(
+            input, weight, kernel_size, bias,
+            stride, padding);


NNPACK is gone. I left it there for Caffe2, but XNNPACK is still faster even if used inefficiently via the commented out code path above.

AshkanAliabadi · 2020-01-05T04:55:37Z

aten/src/ATen/native/mobile/cpu/Engine.h

+// factored out resulting in time savings per calls to forward() - something
+// this API cannot do.  Furthermore, this API does not allow for fusion of
+// non-linear operators that, again, is something that the exposed c10 mobile
+// operators can handle.


This explains why the commented out code paths above are not the most efficient use of our resources.

AshkanAliabadi · 2020-01-05T04:57:55Z

aten/src/ATen/native/mobile/internal/ThreadPool.cpp

+// TODO (Ashkan)
+
+ThreadPool& threadpool() {
+  static ThreadPool threadpool_(4u);


Will have to fix this.

AshkanAliabadi · 2020-01-05T04:59:17Z

torch/nn/intrinsic/modules/fused.py

@@ -1,12 +1,12 @@
 from __future__ import absolute_import, division, print_function, unicode_literals
 import torch
-from torch.nn import Conv2d, Conv3d, ReLU, Linear, BatchNorm2d
+from torch.nn import Conv2d, Conv3d, ReLU, ReLU6, Linear, BatchNorm2d


@jerryzh168

AshkanAliabadi · 2020-01-05T05:03:23Z

torch/nn/mobile/models.py

+
+import torchvision
+
+def mobilenetv2():


@ljk53 @suo @dzhulgakov Having separate mobile-specific models is not ideal. Ideally we want to perform these transformations in a JIT pass. Not sure how to proceed here. Should we move forward with this temporarily or move forward with merging the underlying implementation but not "exposing" it for now through these modules. The old code paths is not going to be very efficient as mentioned in comments above.

As you pointed out correctly let's split this PR into smaller ones. I think python frontend change can be separated out (as you mentioned we are still deciding between this and JIT pass...)

AshkanAliabadi · 2020-01-05T05:04:05Z

torch/quantization/fuse_modules.py

@@ -87,6 +87,7 @@ def fuse_known_modules(mod_list):
    OP_LIST_TO_FUSER_METHOD = {
        (torch.nn.Conv2d, torch.nn.BatchNorm2d): fuse_conv_bn,
        (torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU): fuse_conv_bn_relu,
+        (torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU6): fuse_conv_bn_relu,


@jerryzh168

AshkanAliabadi · 2020-01-05T05:15:36Z

aten/src/ATen/ParallelNative.cpp

@@ -212,9 +203,7 @@ int get_num_threads() {
    return _get_intraop_pool().size() + 1;
  }
 #else
-  caffe2::ThreadPool* pool = caffe2::mobile_threadpool();
-  // caffe2::ThreadPool::getNumThreads() counts the current thread.
-  return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads();


@ilia-cher Can you explain what in_parallel_region() is for? To avoid over parallelization (!?) when invoked from a parallel context? This check was added to avoid a deadlock while we were using caffe2::ThreadPool if I remember correctly, but now that we are moving away from that, is this still needed? cc @ljk53

I remember getNumThreads() acquires a mutex which is not reentrant safe, and there are some call-sites in the codebase that calls getNumThreads() from a thread in the threadpool thus can cause deadlock. I remember @xta0 both added this check here and removed the mutex from the pool (as the number of thread variable won't be updated after initialization so it's very unlikely to have race condition). We can check whether the new get_thread_count() method needs this or not.

AshkanAliabadi · 2020-01-05T05:20:03Z

I think one thing I should do that will make it easier on the reviewers is to break this PR into several smaller ones. :) That way we can merge the core non-controversial parts of the implemetnation that will not affect anyone and then debate on the best ways to enable it and expose it to the user.

Maratyszcza · 2020-01-07T00:02:05Z

While this PR seems to address PyTorch Mobile needs, XNNPACK is not limited to mobile platforms: it supports Web inference through WebAssembly and WebAssembly SIMD micro-kernels and includes a decent and quickly improving set of micro-kernels for server-class x86-64 processors, i.e. targeting AVX/AVX2/AVX512 features. I'd suggest to not restrict its integration to just mobile platforms.

ljk53 · 2020-01-08T23:34:18Z

caffe2/utils/threadpool/pthreadpool_impl.cc

@@ -1,47 +0,0 @@
-#include "caffe2/utils/threadpool/pthreadpool.h"


I'm under the impression that these caffe2 pthreadpool glue code is still used in some FB production since we haven't fully migrated to PyTorch yet, is it correct? In that case we cannot immediately delete these files...

ljk53 · 2020-01-08T23:38:26Z

torch/nn/mobile/models.py

+
+import torchvision
+
+def mobilenetv2():


As you pointed out correctly let's split this PR into smaller ones. I think python frontend change can be separated out (as you mentioned we are still deciding between this and JIT pass...)

ljk53 · 2020-01-08T23:45:49Z

aten/src/ATen/native/mobile/cpu/internal/Convolution.cpp

+}
+
+const auto registry = c10::RegisterOperators()
+  .op("mobile::conv2d_create",


Any chance we can move op registration to native_functions.yaml? We introduced manual op registration for quantized ops but people were debating whether it's right direction to go. I think the current suggestion is still to stick to native_functions.yaml & codegen? cc: @dzhulgakov @gchanan @smessmer

native_functions.yaml doesn't support namespaces like mobile:: yet, but generally yes, it's usually better to define things in native_functions.yaml because you get the C++ frontend generated.

ljk53 · 2020-01-08T23:53:41Z

aten/src/ATen/ParallelNative.cpp

@@ -212,9 +203,7 @@ int get_num_threads() {
    return _get_intraop_pool().size() + 1;
  }
 #else
-  caffe2::ThreadPool* pool = caffe2::mobile_threadpool();
-  // caffe2::ThreadPool::getNumThreads() counts the current thread.
-  return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads();


I remember getNumThreads() acquires a mutex which is not reentrant safe, and there are some call-sites in the codebase that calls getNumThreads() from a thread in the threadpool thus can cause deadlock. I remember @xta0 both added this check here and removed the mutex from the pool (as the number of thread variable won't be updated after initialization so it's very unlikely to have race condition). We can check whether the new get_thread_count() method needs this or not.

ljk53 · 2020-01-09T00:23:11Z

aten/src/ATen/ParallelNative.cpp

-      fn(0, i);
-    }
-  }
+  native::mobile::internal::threadpool().run(fn, range);


I think there is an internal API to query thread-ID: get_thread_num() https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Parallel.h#L30
Do you refer to it or some other API?

Here is one sample call-site of get_thread_num() API in aten: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIteratorReduce.cpp#L34
I think it is used to perform some ad-hoc reduction - not sure if it's actually needed by any mobile model for real.

I remember right now we are using a thread-local to mimic this behavior - I'm not sure it needs depend on specific thread pool implementation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L105

AshkanAliabadi · 2020-01-23T21:27:41Z

Breaking this PR into smaller chunks to enable a more controlled roll-out, starting with #32509. Closing this PR.

Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa

Summary: Pull Request resolved: #33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c

Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa

Summary: Pull Request resolved: #33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c

) Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original pytorch#30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: pytorch#32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa

) Summary: Pull Request resolved: pytorch#33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original pytorch#30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: pytorch#32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c

pytorch deleted a comment from kostmo Dec 12, 2019

pytorch deleted a comment from kostmo Dec 13, 2019

AshkanAliabadi requested a review from apaszke as a code owner December 14, 2019 01:27

AshkanAliabadi requested review from ebetica, goldsborough and yf225 as code owners December 19, 2019 03:31

AshkanAliabadi changed the title ~~WIP~~ NHWC memory layout support and XNNPACK integration for mobile Dec 22, 2019

NHWC memory layout support and XNNPACK integration for mobile.

69f5e7d

AshkanAliabadi requested review from ljk53, gchanan, supriyar, suo, VitalyFedyunin, dreiss and dzhulgakov and removed request for ebetica, yf225, apaszke and goldsborough January 5, 2020 04:45

AshkanAliabadi requested review from jerryzh168 and ilia-cher January 5, 2020 04:47

AshkanAliabadi commented Jan 5, 2020

View reviewed changes

AshkanAliabadi requested review from t-vi and Maratyszcza January 5, 2020 05:20

AshkanAliabadi mentioned this pull request Jan 6, 2020

Upgrade pytorch to use XNNPACK instead of NNPACK for android #30622

Closed

ljk53 reviewed Jan 9, 2020

View reviewed changes

AshkanAliabadi mentioned this pull request Jan 23, 2020

Mobile Backend: NHWC memory layout + XNNPACK integration. #32509

Closed

AshkanAliabadi closed this Jan 23, 2020

AshkanAliabadi requested a review from kimishpatel January 30, 2020 18:56

AshkanAliabadi mentioned this pull request Feb 24, 2020

Mobile Backend: NHWC memory layout + XNNPACK integration. (#32509) #33722

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NHWC memory layout support and XNNPACK integration for mobile #30644

NHWC memory layout support and XNNPACK integration for mobile #30644

AshkanAliabadi commented Dec 3, 2019

kostmo commented Dec 13, 2019 •

edited

AshkanAliabadi commented Jan 5, 2020

AshkanAliabadi left a comment

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020 •

edited

ljk53 Jan 9, 2020

ljk53 Jan 9, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

ljk53 Jan 8, 2020

AshkanAliabadi Jan 5, 2020

AshkanAliabadi Jan 5, 2020

ljk53 Jan 8, 2020

AshkanAliabadi commented Jan 5, 2020

Maratyszcza commented Jan 7, 2020

ljk53 Jan 8, 2020

ljk53 Jan 8, 2020

ljk53 Jan 8, 2020

smessmer Jan 9, 2020

ljk53 Jan 8, 2020

ljk53 Jan 9, 2020

AshkanAliabadi commented Jan 23, 2020 •

edited

		@@ -1,47 +0,0 @@
		#include "caffe2/utils/threadpool/pthreadpool.h"

NHWC memory layout support and XNNPACK integration for mobile #30644

NHWC memory layout support and XNNPACK integration for mobile #30644

Conversation

AshkanAliabadi commented Dec 3, 2019

kostmo commented Dec 13, 2019 • edited

💊 CircleCI build failures summary and remediations

Detailed failure analysis

🕵️ 4 new failures recognized by patterns

pytorch_linux_xenial_py3_clang5_mobile_build (1/4)

pytorch_windows_build (2/4)

pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32 (3/4)

pytorch_linux_backward_compatibility_check_test (4/4)

AshkanAliabadi commented Jan 5, 2020

AshkanAliabadi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshkanAliabadi Jan 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshkanAliabadi commented Jan 5, 2020

Maratyszcza commented Jan 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshkanAliabadi commented Jan 23, 2020 • edited

kostmo commented Dec 13, 2019 •

edited

AshkanAliabadi Jan 5, 2020 •

edited

AshkanAliabadi commented Jan 23, 2020 •

edited