Add AVX implementation for Global Average Pooling layer. #687

kdexd · 2017-05-06T04:00:38Z

Since global average pooling layer calculates average of all activations per channel, we pick up contigious 8 floats and keep on performing vertical sum channelwise. At the end a net sum is accumulated by horizontal sum. This is repeated for all channels of a layer.

Current code falls back to internal backend if nnpack or other unsupported backend is chosen.

kdexd · 2017-05-06T04:02:16Z

@beru I think we should decide specific naming conventions for these vector blocks. Just for the sake of consistency. Right now to avoid confusion, I preferred suffixing such variables with a _m.

beru · 2017-05-06T04:32:04Z

@karandesai-96 I say NAY to Systems Hungarian notation.

beru · 2017-05-06T04:36:40Z

tiny_dnn/core/kernels/avx_kernel_common.h

+  const __m128 fourSum = _mm_hadd_ps(twoSum, twoSum);
+  return fourSum;
+}
+


Please check this page. http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86

It looks like this is faster.

// in : ( x3, x2, x1, x0 ) // out : ( -, -, -, x3+x2+x1+x0 ) inline __m128 hsum128_ps(__m128 x) { // loDual = ( -, -, x1, x0 ) const __m128 loDual = x; // hiDual = ( -, -, x3, x2 ) const __m128 hiDual = _mm_movehl_ps(x, x); // sumDual = ( -, -, x1+x3, x0+x2 ) const __m128 sumDual = _mm_add_ps(loDual, hiDual); // lo = ( -, -, -, x0+x2 ) const __m128 lo = sumDual; // hi = ( -, -, -, x1+x3 ) const __m128 hi = _mm_shuffle_ps(sumDual, sumDual, 0x1); // sum = ( -, -, -, x0+x1+x2+x3 ) const __m128 sum = _mm_add_ss(lo, hi); return sum; }

It was used earlier but I don't need it anymore.

beru · 2017-05-06T04:42:44Z

tiny_dnn/core/kernels/global_avepool_grad_op.h

+                                               context.parallelize());
+    } else {
+      throw nn_error("Not supported engine: " + to_string(engine));
+    }


I don't think all backends should ready its own versions of every kernels as it's highly likely impossible.
so I would simply write fallback routine in this way.

if (engine == core::backend_t::avx) { #ifdef CNN_USE_DOUBLE // todo (kd): add avx implementation for CNN_USE_DOUBLE kernels::global_avepool_grad_op_internal(prev_delta, curr_delta, params, context.parallelize()); #else kernels::global_avepool_grad_op_avx(prev_delta, curr_delta, params, context.parallelize()); #endif } else { kernels::global_avepool_grad_op_internal(prev_delta, curr_delta, params, context.parallelize()); }

beru · 2017-05-06T04:46:22Z

tiny_dnn/core/kernels/global_avepool_op.h

+                                          context.parallelize());
+    } else {
+      throw nn_error("Not supported engine: " + to_string(engine));
+    }


I would write fallback routine in this way.

if (engine == core::backend_t::avx) { #ifdef CNN_USE_DOUBLE // todo (kd): add avx implementation for CNN_USE_DOUBLE kernels::global_avepool_op_internal(in_data, out_data, params, context.parallelize()); #else kernels::global_avepool_op_avx(in_data, out_data, params, context.parallelize()); #endif } else { kernels::global_avepool_op_internal(in_data, out_data, params, context.parallelize()); }

beru · 2017-05-06T04:49:04Z

tiny_dnn/layers/global_average_pooling_layer.h

-        backend_type == backend_t::nnpack) {
+    if (backend_type == core::backend_t::internal ||
+        backend_type == core::backend_t::avx ||
+        backend_type == core::backend_t::nnpack) {


What if other backends will appear in future?
Isn't it inconvenient that implementors have to update this code?

This is done in all the layers right now. I couldn't find a better alternative hence went for consistency.

My understanding is that we can always fallback to internal backend when selected backend doesn't support layer's operation.

beru · 2017-05-06T04:49:56Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+        sum_m = _mm256_add_ps(sum_m, in_m);
+      }
+      out[i] = _mm_cvtss_f32(hsum256_ps(sum_m));
+      out[i] /= pool_area;


It is advised to use reciprocal multiplication instead of division.

beru · 2017-05-06T04:51:02Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+          -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0,
+        };
+        __m256i imask = _mm256_loadu_si256(
+          (__m256i const *)(mask_src + 8 - nremains_per_channel));


You should make this mask variable outside of the loop.

beru · 2017-05-06T04:52:52Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+      size_t j         = 0;
+
+      while (j < nblocks_per_channel) {
+        __m256 prev0 = _mm256_set1_ps(pi);


What's the point of making this variable inside the inner loop even though the value is always the same.

beru · 2017-05-06T04:53:43Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+    const vec_t &curr = curr_delta[sample];
+
+    for (size_t i = 0; i < params.in.depth_; i++) {
+      const float_t pi = curr[i] / pool_area;


It is advised to replace division with reciprocal multiplication.

beru · 2017-05-06T08:39:03Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+namespace tiny_dnn {
+namespace kernels {
+
+#ifdef CNN_USE_AVX


global_avepool_op_avx and global_avepool_grad_op_avx functions are referenced even when CNN_USE_AVX isn't defined. So if you reference them, you should keep them.

In tiny_dnn/core/kernels/conv2d_op_avx.h file, I divided interface function conv2d_op_avx and implementation function avx_conv2d_5x5_kernel.

I rectified this one - similar to what is done in conv2d and fully connected kernels.

beru · 2017-05-07T04:18:49Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+  const core::global_avepool_params &params,
+  const bool layer_parallelize) {
+  const size_t pool_area            = params.in.width_ * params.in.height_;
+  const size_t pool_area_inv        = 1.0f / pool_area;


It's impossible for size_t type variable to retain floating point value.

beru

You also need to write test code for new implementation.

beru · 2017-05-07T04:34:02Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+  CNN_UNREFERENCED_PARAMETER(out_data);
+  CNN_UNREFERENCED_PARAMETER(params);
+  CNN_UNREFERENCED_PARAMETER(layer_parallelize);
+  throw nn_error("TinyDNN has not been compiled with AVX support.");


I'd write fallback call to internal backend routine global_avepool_op_internal instead of throwing an exceotion.

IIRC that's what's done now in all layers (exception). Probably fallback to internal backend with warning is better solution

Hmmm @Randl and @karandesai-96 are right, I didn't check other layers implementation.
Giving warning at compile time with #pragma message?

Yeah, that's what I thought too

beru · 2017-05-07T04:36:24Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+  CNN_UNREFERENCED_PARAMETER(curr_delta);
+  CNN_UNREFERENCED_PARAMETER(params);
+  CNN_UNREFERENCED_PARAMETER(layer_parallelize);
+  throw nn_error("TinyDNN has not been compiled with AVX support.");


I'd write fallback call to internal backend routine global_avepool_grad_op_internal instead of throwing an exceotion.

In that way, you can remove backend type check in global_average_pooling_layer::init_backend method in tiny_dnn/layers/global_average_pooling_layer.h file.

beru · 2017-05-07T07:07:55Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+  const core::global_avepool_params &params,
+  const bool layer_parallelize) {
+  const size_t pool_area            = params.in.width_ * params.in.height_;
+  const float_t pool_area_inv       = 1.0f / pool_area;


You don't need to use float_t in here. Just float is OK. And don't forget to cast pool_area to float with static_cast, otherwise I think compiler gives a warning.

I'd use _mm_set_ss to hold pool_area value in __m128 typed variable and use _mm_rcp_ss to compute a reciprocal. In that way, you can multiply it with the result of hsum256_ps with _mm_mul_ss.

You can use _mm_cvtsi32_ss intrinsic function to set single dword(32bit) variable to first slot in __m128.

beru · 2017-05-07T16:31:25Z

tiny_dnn/core/kernels/global_avepool_grad_op.h

+                                          context.parallelize());
+    } else {
+      // fallback to internal implementation as nnpack implementation
+      // is not available


The above comment is outdated.

beru · 2017-05-07T16:34:08Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+  __m256i imask =
+    _mm256_loadu_si256((__m256i const *)(mask_src + 8 - nremains_per_channel));
+
+  for_i(layer_parallelize, in_data.size(), [&](int sample) {


Please use size_t sample instead of int sample for functor lambda's argument with for_i template function.
#679 is related.

beru · 2017-05-07T16:36:21Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+      const size_t depth_index = i * pool_area;
+      for (size_t j = 0; j < nblocks_per_channel; j++) {
+        __m256 in_m = _mm256_load_ps(&in[depth_index + 8 * j]);
+        sum_m       = _mm256_add_ps(sum_m, in_m);


You should unroll this loop 8x times.

beru · 2017-05-07T16:37:16Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+      const size_t depth_index = i * pool_area;
+      for (size_t j = 0; j < nblocks_per_channel; j++) {
+        __m256d in_m = _mm256_load_pd(&in[depth_index + 4 * j]);
+        sum_m        = _mm256_add_pd(sum_m, in_m);


You should unroll this loop 8x times.

I didn't understand this, what do you mean by unrolling 8x times? Aren't we doing vertical additions till the end then followed by horizontal addition?

It's related with instruction latencies and throughputs, Instruction-level parallelism, 16 YMM registers etc...
The Haswell microarchitecture, The Skylake microarchitecture, they have 8 execution ports. etc, etc..

beru · 2017-05-07T16:38:14Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+    const vec_t &curr = curr_delta[sample];
+
+    for (size_t i = 0; i < params.in.depth_; i++) {
+      const double pi = curr[i] * pool_area_inv;


You should rewrite this line with AVX intrinsics functions.

beru · 2017-05-09T13:11:44Z

tiny_dnn/core/kernels/global_avepool_op_avx.h

+      __m256d sum_m            = _mm256_setzero_pd();
+      const size_t depth_index = i * pool_area;
+      for (size_t j = 0; j < nblocks_per_channel; j++) {
+        __m256d in_m = _mm256_load_pd(&in[depth_index + 4 * j]);


Are you sure the address &in[depth_index + 4 * j] is always aligned to 32 bytes?
If not, you should use _mm256_loadu_pd.

kdexd · 2017-05-10T09:04:36Z

@beru I have worked with the review comments. Please have a look once more.
@edgarriba some patterns are different than other layers - such as fallback, macro protection etc. if this works well, we can make it uniform throughout.

edgarriba · 2017-06-04T19:33:34Z

@karandesai-96 could you rename 494e1bb saying something like "add google bancharmk framework"?

beru

#if defined(_MSC_VER)
#define CNN_MUST_INLINE __forceinline
#elif defined(__GNUC__) || defined(__clang__) || defined(__ICC)
#define CNN_MUST_INLINE __attribute__((always_inline)) inline
#else
#define CNN_MUST_INLINE inline
#endif

beru · 2017-06-05T17:02:12Z

@karandesai-96 You need to replace CNN_MUSTINLINE with CNN_MUST_INLINE in file tiny_dnn/util/product.h. How couldn't you notice the little underbar trap I added...

kdexd · 2017-06-06T04:07:09Z

@beru Be right back, make a coffee with stronger caffeine content xD

Randl

LGTM, a couple of small fixes

Randl · 2017-06-09T14:47:55Z

CMakeLists.txt

+option(BUILD_TESTS      "Set to ON to build tests"              OFF)
+option(BUILD_EXAMPLES   "Set to ON to build examples"           OFF)
+option(BUILD_DOCS       "Set to ON to build documentation"      OFF)
+option(BUILD_BENCHMARKS "Set to ON to build documentation"      OFF)


s/documentation/benchmark/.

Randl · 2017-06-09T14:49:43Z

benchmarks/bm_global_avepool.h

@@ -0,0 +1,50 @@
+#pragma once


Licence comment is missing

Randl · 2017-06-09T14:51:33Z

cmake/clang-cxx-dev-tools.cmake

@@ -3,6 +3,7 @@

 file(GLOB_RECURSE ALL_CXX_SOURCE_FILES
  ${CMAKE_SOURCE_DIR}/tiny_dnn/*.h
+  ${CMAKE_SOURCE_DIR}/benchmarks/*.h


*.cpp too

Randl · 2017-06-09T14:53:33Z

tiny_dnn/core/kernels/global_avepool_grad_op.h

+    const core::backend_t engine = context.engine();
+
+    if (engine == core::backend_t::avx) {
+#ifdef CNN_USE_AVX


This would mean that if engine is set to core::backend_t::avx and CNN_USE_AVX is undefined, nothing happens? This is kinda not intuitive and also inconsistent with other layers where we fall back to internal in this case.

@Randl I think killing core::backend_t::avx entry and all the AVX related codes when CNN_USE_AVX isn't defined would be a simpler solution.

I mean disabling the code with preprocessor directives.

Vote for it

@edgarriba You need to remember that voting system is powerless before you.

beru · 2017-06-09T15:37:30Z

tiny_dnn/util/macro.h

+#define CNN_MUSTINLINE __attribute__((always_inline)) inline
+#else
+#define CNN_MUSTINLINE inline
+#endif


@karandesai-96 You_should_start_liking_to_insert_an_underbar_between_words.

@beru CNN_SCREAMING_SNAKE_CASE_FOR_THE_WIN

edgarriba · 2017-06-09T18:49:19Z

Good to merge?

kdexd · 2017-06-10T06:39:38Z

@edgarriba will fix the typo and license comment once I finish travelling.

beru · 2017-06-13T13:22:28Z

tiny_dnn/util/macro.h

@@ -12,3 +12,11 @@
 #if defined _WIN32 && !defined(__MINGW32__)
 #define CNN_WINDOWS
 #endif
+
+#if defined(_MSC_VER)
+#define CNN_MUSTINLINE __forceinline


There's sth missing between T and I. (Hint : 0x5F)

How did I miss this 😕 thanks for pointing out

beru · 2017-06-13T13:30:46Z

tiny_dnn/util/macro.h

+#elif defined(__GNUC__) || defined(__clang__) || defined(__ICC)
+#define CNN_MUST_INLINE __attribute__((always_inline)) inline
+#else
+#define CNN_MUSTINLINE inline


Feel free to perform statement coverage test for this path.

is that what makes appveyor fail?

This one #687 (comment)
is the cause.

edgarriba · 2017-06-16T17:32:57Z

cool! @beru @karandesai-96 see #766

kdexd changed the title ~~Add AVX implementation for Global Average pooling layer.~~ Add AVX implementation for Global Average Pooling layer. May 6, 2017

beru added performance PR: Needs Review labels May 6, 2017

beru reviewed May 6, 2017

View reviewed changes

beru reviewed May 7, 2017

View reviewed changes

beru requested changes May 7, 2017

View reviewed changes

beru reviewed May 7, 2017

View reviewed changes

kdexd force-pushed the global-avepool-avx branch from 19cf6c3 to 4cd0fd9 Compare May 7, 2017 06:49

beru reviewed May 7, 2017

View reviewed changes

beru reviewed May 9, 2017

View reviewed changes

kdexd force-pushed the global-avepool-avx branch 3 times, most recently from afba6a5 to 962b229 Compare May 10, 2017 08:55

kdexd force-pushed the global-avepool-avx branch from af2009d to 3825b96 Compare June 4, 2017 14:05

edgarriba approved these changes Jun 4, 2017

View reviewed changes

beru requested changes Jun 4, 2017

View reviewed changes

Karan Desai added 2 commits June 5, 2017 10:13

Add google benchmark framework.

f4942b1

Perform loop unrolling in global avepool avx kernels.

b38fd10

kdexd force-pushed the global-avepool-avx branch 2 times, most recently from b8440f2 to befed45 Compare June 5, 2017 04:49

kdexd force-pushed the global-avepool-avx branch 2 times, most recently from 3d61f96 to 8466482 Compare June 8, 2017 04:28

Randl approved these changes Jun 9, 2017

View reviewed changes

beru reviewed Jun 9, 2017

View reviewed changes

beru approved these changes Jun 9, 2017

View reviewed changes

beru mentioned this pull request Jun 10, 2017

Recurrent neural networks #729

Closed

4 tasks

edgarriba added the high priority label Jun 11, 2017

beru mentioned this pull request Jun 11, 2017

Rewrite product.h #733

Closed

kdexd force-pushed the global-avepool-avx branch from 8466482 to af7cd45 Compare June 12, 2017 19:32

beru reviewed Jun 13, 2017

View reviewed changes

Rename macros.

ebdd564

kdexd force-pushed the global-avepool-avx branch from af7cd45 to ebdd564 Compare June 14, 2017 18:18

beru merged commit 494c9fc into tiny-dnn:master Jun 16, 2017

beru added PR: Good to Merge and removed PR: Needs Review labels Jun 17, 2017

kdexd deleted the global-avepool-avx branch July 18, 2017 17:39

Add AVX implementation for Global Average Pooling layer. #687

Add AVX implementation for Global Average Pooling layer. #687

Conversation

kdexd commented May 6, 2017 • edited

kdexd commented May 6, 2017

beru commented May 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beru May 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beru May 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdexd commented May 10, 2017

edgarriba commented Jun 4, 2017

beru left a comment

Choose a reason for hiding this comment

beru commented Jun 5, 2017

kdexd commented Jun 6, 2017

Randl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgarriba commented Jun 9, 2017

kdexd commented Jun 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgarriba commented Jun 16, 2017

kdexd commented May 6, 2017 •

edited

beru May 7, 2017 •

edited

beru May 7, 2017 •

edited