Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NHWC memory layout support and XNNPACK integration for mobile #30644

Closed
wants to merge 1 commit into from
Closed

NHWC memory layout support and XNNPACK integration for mobile #30644

wants to merge 1 commit into from

Conversation

AshkanAliabadi
Copy link
Contributor

Gathering CI signal for now ...

@pytorch pytorch deleted a comment from kostmo Dec 12, 2019
@pytorch pytorch deleted a comment from kostmo Dec 12, 2019
@pytorch pytorch deleted a comment from kostmo Dec 12, 2019
@pytorch pytorch deleted a comment from kostmo Dec 13, 2019
@kostmo
Copy link
Member

kostmo commented Dec 13, 2019

💊 CircleCI build failures summary and remediations

As of commit 69f5e7d:

  • 4/4 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 4 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_linux_xenial_py3_clang5_mobile_build (1/4)

Step: "Build" (full log | pattern match details)

Dec 30 22:55:47 operator-run.c:(.text+0x19f4): undefined reference to `pthreadpool_compute_2d'
Dec 30 22:55:47 operator-run.c:(.text+0xedf): undefined reference to `pthreadpool_compute_1d' 
Dec 30 22:55:47 operator-run.c:(.text+0xfb5): undefined reference to `pthreadpool_compute_1d' 
Dec 30 22:55:47 operator-run.c:(.text+0x10c9): undefined reference to `pthreadpool_compute_2d' 
Dec 30 22:55:47 operator-run.c:(.text+0x1359): undefined reference to `pthreadpool_compute_2d' 
Dec 30 22:55:47 operator-run.c:(.text+0x1435): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x1543): undefined reference to `pthreadpool_compute_4d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x16e3): undefined reference to `pthreadpool_compute_4d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x17d6): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x1834): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x18b3): undefined reference to `pthreadpool_compute_1d_tiled' 
Dec 30 22:55:47 operator-run.c:(.text+0x19f4): undefined reference to `pthreadpool_compute_2d' 
Dec 30 22:55:47 collect2: error: ld returned 1 exit status 
Dec 30 22:55:47 CMakeFiles/main.dir/build.make:102: recipe for target 'main' failed 
Dec 30 22:55:47 make[2]: *** [main] Error 1 
Dec 30 22:55:47 CMakeFiles/Makefile2:104: recipe for target 'CMakeFiles/main.dir/all' failed 
Dec 30 22:55:47 make[1]: *** [CMakeFiles/main.dir/all] Error 2 
Dec 30 22:55:47 Makefile:127: recipe for target 'all' failed 
Dec 30 22:55:47 make: *** [all] Error 2 
Dec 30 22:55:47 + sccache_epilogue 
Dec 30 22:55:47 + echo '=================== sccache compilation log ===================' 
Dec 30 22:55:47 =================== sccache compilation log =================== 

See CircleCI build pytorch_windows_build (2/4)

Step: "Build" (full log | pattern match details)

FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/qlinear.cpp.obj
C:\Users\circleci\project\aten\src\ATen/native/mobile/internal/ThreadPool.h(4): fatal error C1083: Cannot open include file: 'pthreadpool.h': No such file or directory
N -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\detail\CUDAHooksInterface.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\detail\CUDAHooksInterface.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.11.25548.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

-DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qconv.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qconv.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/qconv.cpp.obj  
-DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qconv.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qconv.cpp 
C:\Users\circleci\project\aten\src\ATen/native/mobile/internal/ThreadPool.h(4): fatal error C1083: Cannot open include file: 'pthreadpool.h': No such file or directory
VE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qlinear.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qlinear.cpp 
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/quantized/cpu/qlinear.cpp.obj  
VE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\qlinear.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\qlinear.cpp 
C:\Users\circleci\project\aten\src\ATen/native/mobile/internal/ThreadPool.h(4): fatal error C1083: Cannot open include file: 'pthreadpool.h': No such file or directory
p -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\core\Dimname.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\core\Dimname.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.11.25548.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

U_DEFINITION /MD /O2 /Ob2  /w /EHa /MP /bigobj   -DCUDA_HAS_FP16=1 -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD /Z7 /EHa /DNOMINMAX /wd4267 /wd4251 /wd4522 /wd4838 /wd4305 /wd4244 /wd4190 /wd4101 /wd4996 /wd4275 /bigobj -O2 -DCAFFE2_BUILD_MAIN_LIB -DONNX_BUILD_MAIN_LIB -std:c++14 /showIncludes /Focaffe2\CMakeFiles\torch_cpu.dir\__\aten\src\ATen\native\quantized\cpu\q_adaavgpool.cpp.obj /Fdcaffe2\CMakeFiles\torch_cpu.dir\ /FS -c ..\aten\src\ATen\native\quantized\cpu\q_adaavgpool.cpp 
Microsoft (R) C/C++ Optimizing Compiler Version 19.11.25548.2 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

See CircleCI build pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-x86_32 (3/4)

Step: "pytorch android gradle build only x86_32 (for PR)" (full log | pattern match details)

Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build failed with an exception.
Dec 30 23:33:55 23:33:43.493 [DEBUG] [org.gradle.api.internal.changedetection.state.CacheBackedTaskHistoryRepository] Fingerprinting property otherFolderOutputs.otherFolderOutput1 for task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease' 
Dec 30 23:33:55 23:33:43.493 [DEBUG] [org.gradle.api.internal.changedetection.state.CacheBackedTaskHistoryRepository] Fingerprinting property streamOutputFolder (Output) for task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease' 
Dec 30 23:33:55 23:33:43.493 [INFO] [org.gradle.api.internal.tasks.execution.SkipUpToDateTaskExecuter] Task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease' is not up-to-date because: 
Dec 30 23:33:55   No history is available. 
Dec 30 23:33:55 23:33:43.493 [DEBUG] [org.gradle.api.internal.tasks.execution.OutputDirectoryCreatingTaskExecuter] Ensuring directory exists for property otherFolderOutputs.otherFolderOutput1 at /var/lib/jenkins/workspace/android/libs/fbjni_local/build/intermediates/incremental/release-mergeJniLibs/zip-cache 
Dec 30 23:33:55 23:33:43.494 [DEBUG] [org.gradle.api.internal.tasks.execution.OutputDirectoryCreatingTaskExecuter] Ensuring directory exists for property streamOutputFolder (Output) at /var/lib/jenkins/workspace/android/libs/fbjni_local/build/intermediates/transforms/mergeJniLibs/release 
Dec 30 23:33:55 23:33:43.494 [DEBUG] [org.gradle.api.internal.tasks.execution.ExecuteActionsTaskExecuter] Executing actions for task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease'. 
Dec 30 23:33:55 23:33:43.494 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Build operation 'Execute transform for :fbjni:transformNativeLibsWithMergeJniLibsForRelease' started 
Dec 30 23:33:55 23:33:43.494 [INFO] [org.gradle.api.internal.changedetection.changes.RebuildIncrementalTaskInputs] All input files are considered out-of-date for incremental task ':fbjni:transformNativeLibsWithMergeJniLibsForRelease'. 
Dec 30 23:33:55 23:33:43.522 [DEBUG] [org.gradle.internal.operations.DefaultBuildOperationExecutor] Completing Build operation 'Execute transform for :fbjni:transformNativeLibsWithMergeJniLi23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] FAILURE: Build failed with an exception. 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] * What went wrong: 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] Execution failed for task ':pytorch_android:externalNativeBuildRelease'. 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] > Build command failed. 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]   Error while executing process /usr/local/bin/cmake with arguments {--build /var/lib/jenkins/workspace/android/pytorch_android/.externalNativeBuild/cmake/release/x86 --target pytorch_jni} 
Dec 30 23:33:55 23:33:51.289 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]    
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]   ninja: error: '../../../../src/main/jniLibs/x86/libnnpack.a', needed by '../../../../build/intermediates/cmake/release/obj/x86/libpytorch_jni.so', missing and no known rule to make it 
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter]  
Dec 30 23:33:55 23:33:51.290 [ERROR] [org.gradle.internal.buildevents.BuildExceptionReporter] * Try: 

See CircleCI build pytorch_linux_backward_compatibility_check_test (4/4)

Step: "Test" (full log | pattern match details)

Dec 30 23:43:17 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.
Dec 30 23:43:17 processing existing schema:  aten::values(Dict(int, t) self) -> (t[](*)) 
Dec 30 23:43:17 processing existing schema:  aten::values(Dict(float, t) self) -> (t[](*)) 
Dec 30 23:43:17 processing existing schema:  aten::values(Dict(Tensor, t) self) -> (t[](*)) 
Dec 30 23:43:17 processing existing schema:  aten::view(Tensor(a) self, int[] size) -> (Tensor(a)) 
Dec 30 23:43:17 processing existing schema:  aten::where(Tensor condition) -> (Tensor[]) 
Dec 30 23:43:17 processing existing schema:  aten::where.self(Tensor condition, Tensor self, Tensor other) -> (Tensor) 
Dec 30 23:43:17 processing existing schema:  aten::zeros_like(Tensor self, *, int? memory_format=None) -> (Tensor) 
Dec 30 23:43:17 processing existing schema:  aten::zeros_like.dtype(Tensor self, *, int dtype, int layout, Device device, bool pin_memory=False, int? memory_format=None) -> (Tensor) 
Dec 30 23:43:17 processing existing schema:  quantized::add(Tensor qa, Tensor qb, float scale, int zero_point) -> (Tensor qc) 
Dec 30 23:43:17 processing existing schema:  quantized::add_relu(Tensor qa, Tensor qb, float scale, int zero_point) -> (Tensor qc) 
Dec 30 23:43:17 The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not.  
Dec 30 23:43:17  
Dec 30 23:43:17 Broken ops: [ 
Dec 30 23:43:17 	aten::_nnpack_available() -> (bool) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution_backward_input(Tensor input, Tensor grad_output, Tensor weight, int[2] padding) -> (Tensor) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution(Tensor input, Tensor weight, Tensor? bias, int[2] padding, int[2] stride=[1, 1]) -> (Tensor) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution_backward_weight(Tensor input, int[] weightsize, Tensor grad_output, int[2] padding) -> (Tensor) 
Dec 30 23:43:17 	aten::_nnpack_spatial_convolution_backward(Tensor input, Tensor grad_output, Tensor weight, int[2] padding, bool[3] output_mask) -> (Tensor, Tensor, Tensor) 
Dec 30 23:43:17 ] 
Dec 30 23:43:17 =================== sccache compilation log =================== 
Dec 30 23:43:17 + cleanup 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 177 times.

@AshkanAliabadi AshkanAliabadi changed the title WIP NHWC memory layout support and XNNPACK integration for mobile Dec 22, 2019
@AshkanAliabadi
Copy link
Contributor Author

Please review this code or areas you are interested in. Any help appreciated.

Copy link
Contributor Author

@AshkanAliabadi AshkanAliabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Publishing to collect feedback.

@@ -5,8 +5,7 @@
#ifndef C10_MOBILE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilia-cher Please take a look.

fn(0, i);
}
}
native::mobile::internal::threadpool().run(fn, range);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilia-cher What would be the repercussions of not passing the thread ID to the callback? pthreadpool, the lower level threading library we use on mobile, does not support that feature and I am wondering whether that matters enough to go through the trouble of supporting that use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an internal API to query thread-ID: get_thread_num() https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Parallel.h#L30
Do you refer to it or some other API?

Here is one sample call-site of get_thread_num() API in aten: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIteratorReduce.cpp#L34
I think it is used to perform some ad-hoc reduction - not sure if it's actually needed by any mobile model for real.

I remember right now we are using a thread-local to mimic this behavior - I'm not sure it needs depend on specific thread pool implementation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L105

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a closer took, seems you refer to the first param (thread_pool_task_id) of the fn callback of _run_with_pool - I think this callback is only used in this file by _parallel_run, the first param is marked as unused there: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L135

If we move to using pthreadpool which doesn't pass thread-id to fn then we can simply remove it. It's doesn't seem to be in the public API of parallel_for callback signature (which handles task_id instead of thread_id).

@@ -24,14 +25,26 @@ DEFINE_DISPATCH(leaky_relu_stub);
DEFINE_DISPATCH(leaky_relu_backward_stub);

Tensor hardtanh(const Tensor& self, Scalar min, Scalar max) {
// if (mobile::cpu::use_clamp(self, min, max)) {
// return mobile::cpu::clamp(self, min, max);
// }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will enable all these. And all the commented out lines below in this diff.

// output = mobile::cpu::convolution(
// input, weight, bias,
// params.padding, params.stride, params.dilation, params.groups, params.transposed);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mobile hook for convolutions. This code path is not ideal since portions of the convolutions can be calculated once and cached - a major cause for performance uplift. This codepath cannot do that. Ideally we want to use newly introduced modules at the end of this diff, or even better, use JIT.

for non-dilated case here */
return at::thnn_conv2d(
input, weight, kernel_size, bias,
stride, padding);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNPACK is gone. I left it there for Caffe2, but XNNPACK is still faster even if used inefficiently via the commented out code path above.

// factored out resulting in time savings per calls to forward() - something
// this API cannot do. Furthermore, this API does not allow for fusion of
// non-linear operators that, again, is something that the exposed c10 mobile
// operators can handle.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explains why the commented out code paths above are not the most efficient use of our resources.

// TODO (Ashkan)

ThreadPool& threadpool() {
static ThreadPool threadpool_(4u);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will have to fix this.

@@ -1,12 +1,12 @@
from __future__ import absolute_import, division, print_function, unicode_literals
import torch
from torch.nn import Conv2d, Conv3d, ReLU, Linear, BatchNorm2d
from torch.nn import Conv2d, Conv3d, ReLU, ReLU6, Linear, BatchNorm2d
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


import torchvision

def mobilenetv2():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljk53 @suo @dzhulgakov Having separate mobile-specific models is not ideal. Ideally we want to perform these transformations in a JIT pass. Not sure how to proceed here. Should we move forward with this temporarily or move forward with merging the underlying implementation but not "exposing" it for now through these modules. The old code paths is not going to be very efficient as mentioned in comments above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out correctly let's split this PR into smaller ones. I think python frontend change can be separated out (as you mentioned we are still deciding between this and JIT pass...)

@@ -87,6 +87,7 @@ def fuse_known_modules(mod_list):
OP_LIST_TO_FUSER_METHOD = {
(torch.nn.Conv2d, torch.nn.BatchNorm2d): fuse_conv_bn,
(torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU): fuse_conv_bn_relu,
(torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU6): fuse_conv_bn_relu,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -212,9 +203,7 @@ int get_num_threads() {
return _get_intraop_pool().size() + 1;
}
#else
caffe2::ThreadPool* pool = caffe2::mobile_threadpool();
// caffe2::ThreadPool::getNumThreads() counts the current thread.
return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilia-cher Can you explain what in_parallel_region() is for? To avoid over parallelization (!?) when invoked from a parallel context? This check was added to avoid a deadlock while we were using caffe2::ThreadPool if I remember correctly, but now that we are moving away from that, is this still needed? cc @ljk53

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember getNumThreads() acquires a mutex which is not reentrant safe, and there are some call-sites in the codebase that calls getNumThreads() from a thread in the threadpool thus can cause deadlock. I remember @xta0 both added this check here and removed the mutex from the pool (as the number of thread variable won't be updated after initialization so it's very unlikely to have race condition). We can check whether the new get_thread_count() method needs this or not.

@AshkanAliabadi
Copy link
Contributor Author

I think one thing I should do that will make it easier on the reviewers is to break this PR into several smaller ones. :) That way we can merge the core non-controversial parts of the implemetnation that will not affect anyone and then debate on the best ways to enable it and expose it to the user.

@Maratyszcza
Copy link
Contributor

While this PR seems to address PyTorch Mobile needs, XNNPACK is not limited to mobile platforms: it supports Web inference through WebAssembly and WebAssembly SIMD micro-kernels and includes a decent and quickly improving set of micro-kernels for server-class x86-64 processors, i.e. targeting AVX/AVX2/AVX512 features. I'd suggest to not restrict its integration to just mobile platforms.

@@ -1,47 +0,0 @@
#include "caffe2/utils/threadpool/pthreadpool.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm under the impression that these caffe2 pthreadpool glue code is still used in some FB production since we haven't fully migrated to PyTorch yet, is it correct? In that case we cannot immediately delete these files...


import torchvision

def mobilenetv2():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out correctly let's split this PR into smaller ones. I think python frontend change can be separated out (as you mentioned we are still deciding between this and JIT pass...)

}

const auto registry = c10::RegisterOperators()
.op("mobile::conv2d_create",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can move op registration to native_functions.yaml? We introduced manual op registration for quantized ops but people were debating whether it's right direction to go. I think the current suggestion is still to stick to native_functions.yaml & codegen? cc: @dzhulgakov @gchanan @smessmer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

native_functions.yaml doesn't support namespaces like mobile:: yet, but generally yes, it's usually better to define things in native_functions.yaml because you get the C++ frontend generated.

@@ -212,9 +203,7 @@ int get_num_threads() {
return _get_intraop_pool().size() + 1;
}
#else
caffe2::ThreadPool* pool = caffe2::mobile_threadpool();
// caffe2::ThreadPool::getNumThreads() counts the current thread.
return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember getNumThreads() acquires a mutex which is not reentrant safe, and there are some call-sites in the codebase that calls getNumThreads() from a thread in the threadpool thus can cause deadlock. I remember @xta0 both added this check here and removed the mutex from the pool (as the number of thread variable won't be updated after initialization so it's very unlikely to have race condition). We can check whether the new get_thread_count() method needs this or not.

fn(0, i);
}
}
native::mobile::internal::threadpool().run(fn, range);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an internal API to query thread-ID: get_thread_num() https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Parallel.h#L30
Do you refer to it or some other API?

Here is one sample call-site of get_thread_num() API in aten: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIteratorReduce.cpp#L34
I think it is used to perform some ad-hoc reduction - not sure if it's actually needed by any mobile model for real.

I remember right now we are using a thread-local to mimic this behavior - I'm not sure it needs depend on specific thread pool implementation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L105

@AshkanAliabadi
Copy link
Contributor Author

AshkanAliabadi commented Jan 23, 2020

Breaking this PR into smaller chunks to enable a more controlled roll-out, starting with #32509. Closing this PR.

facebook-github-bot pushed a commit that referenced this pull request Feb 24, 2020
Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: #32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
facebook-github-bot pushed a commit that referenced this pull request Feb 25, 2020
Summary:
Pull Request resolved: #33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: #32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
hczhu pushed a commit that referenced this pull request Feb 28, 2020
Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: #32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
hczhu pushed a commit that referenced this pull request Feb 28, 2020
Summary:
Pull Request resolved: #33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: #32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
)

Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original pytorch#30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: pytorch#32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
)

Summary:
Pull Request resolved: pytorch#33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original pytorch#30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: pytorch#32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants