Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX implementation for Global Average Pooling layer. #687

Merged
merged 8 commits into from Jun 16, 2017

Conversation

kdexd
Copy link
Contributor

@kdexd kdexd commented May 6, 2017

Since global average pooling layer calculates average of all activations per channel, we pick up contigious 8 floats and keep on performing vertical sum channelwise. At the end a net sum is accumulated by horizontal sum. This is repeated for all channels of a layer.

Current code falls back to internal backend if nnpack or other unsupported backend is chosen.

@kdexd kdexd changed the title Add AVX implementation for Global Average pooling layer. Add AVX implementation for Global Average Pooling layer. May 6, 2017
@kdexd
Copy link
Contributor Author

kdexd commented May 6, 2017

@beru I think we should decide specific naming conventions for these vector blocks. Just for the sake of consistency. Right now to avoid confusion, I preferred suffixing such variables with a _m.

@beru
Copy link
Contributor

beru commented May 6, 2017

@karandesai-96 I say NAY to Systems Hungarian notation.

const __m128 fourSum = _mm_hadd_ps(twoSum, twoSum);
return fourSum;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this page. http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86

It looks like this is faster.

// in  : ( x3, x2, x1, x0 )
// out : (  -,  -,  -, x3+x2+x1+x0 )
inline __m128 hsum128_ps(__m128 x)
{
  // loDual = ( -, -, x1, x0 )
  const __m128 loDual = x;
  // hiDual = ( -, -, x3, x2 )
  const __m128 hiDual = _mm_movehl_ps(x, x);
  // sumDual = ( -, -, x1+x3, x0+x2 )
  const __m128 sumDual = _mm_add_ps(loDual, hiDual);
  // lo = ( -, -, -, x0+x2 )
  const __m128 lo = sumDual;
  // hi = ( -, -, -, x1+x3 )
  const __m128 hi = _mm_shuffle_ps(sumDual, sumDual, 0x1);
  // sum = ( -, -, -, x0+x1+x2+x3 )
  const __m128 sum = _mm_add_ss(lo, hi);
  return sum;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used earlier but I don't need it anymore.

context.parallelize());
} else {
throw nn_error("Not supported engine: " + to_string(engine));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all backends should ready its own versions of every kernels as it's highly likely impossible.
so I would simply write fallback routine in this way.

    if (engine == core::backend_t::avx) {
#ifdef CNN_USE_DOUBLE
      // todo (kd): add avx implementation for CNN_USE_DOUBLE
      kernels::global_avepool_grad_op_internal(prev_delta, curr_delta, params,
                                               context.parallelize());
#else
      kernels::global_avepool_grad_op_avx(prev_delta, curr_delta, params,
                                          context.parallelize());
#endif
    } else {
      kernels::global_avepool_grad_op_internal(prev_delta, curr_delta, params,
                                               context.parallelize());
    }

context.parallelize());
} else {
throw nn_error("Not supported engine: " + to_string(engine));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write fallback routine in this way.

    if (engine == core::backend_t::avx) {
#ifdef CNN_USE_DOUBLE
      // todo (kd): add avx implementation for CNN_USE_DOUBLE
      kernels::global_avepool_op_internal(in_data, out_data, params,
                                          context.parallelize());
#else
      kernels::global_avepool_op_avx(in_data, out_data, params,
                                     context.parallelize());
#endif
    } else {
      kernels::global_avepool_op_internal(in_data, out_data, params,
                                          context.parallelize());
    }

backend_type == backend_t::nnpack) {
if (backend_type == core::backend_t::internal ||
backend_type == core::backend_t::avx ||
backend_type == core::backend_t::nnpack) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if other backends will appear in future?
Isn't it inconvenient that implementors have to update this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in all the layers right now. I couldn't find a better alternative hence went for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we can always fallback to internal backend when selected backend doesn't support layer's operation.

sum_m = _mm256_add_ps(sum_m, in_m);
}
out[i] = _mm_cvtss_f32(hsum256_ps(sum_m));
out[i] /= pool_area;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is advised to use reciprocal multiplication instead of division.

-1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0,
};
__m256i imask = _mm256_loadu_si256(
(__m256i const *)(mask_src + 8 - nremains_per_channel));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should make this mask variable outside of the loop.

size_t j = 0;

while (j < nblocks_per_channel) {
__m256 prev0 = _mm256_set1_ps(pi);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of making this variable inside the inner loop even though the value is always the same.

const vec_t &curr = curr_delta[sample];

for (size_t i = 0; i < params.in.depth_; i++) {
const float_t pi = curr[i] / pool_area;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is advised to replace division with reciprocal multiplication.

namespace tiny_dnn {
namespace kernels {

#ifdef CNN_USE_AVX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_avepool_op_avx and global_avepool_grad_op_avx functions are referenced even when CNN_USE_AVX isn't defined. So if you reference them, you should keep them.

In tiny_dnn/core/kernels/conv2d_op_avx.h file, I divided interface function conv2d_op_avx and implementation function avx_conv2d_5x5_kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rectified this one - similar to what is done in conv2d and fully connected kernels.

const core::global_avepool_params &params,
const bool layer_parallelize) {
const size_t pool_area = params.in.width_ * params.in.height_;
const size_t pool_area_inv = 1.0f / pool_area;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible for size_t type variable to retain floating point value.

Copy link
Contributor

@beru beru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to write test code for new implementation.

CNN_UNREFERENCED_PARAMETER(out_data);
CNN_UNREFERENCED_PARAMETER(params);
CNN_UNREFERENCED_PARAMETER(layer_parallelize);
throw nn_error("TinyDNN has not been compiled with AVX support.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write fallback call to internal backend routine global_avepool_op_internal instead of throwing an exceotion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC that's what's done now in all layers (exception). Probably fallback to internal backend with warning is better solution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm @Randl and @karandesai-96 are right, I didn't check other layers implementation.
Giving warning at compile time with #pragma message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I thought too

CNN_UNREFERENCED_PARAMETER(curr_delta);
CNN_UNREFERENCED_PARAMETER(params);
CNN_UNREFERENCED_PARAMETER(layer_parallelize);
throw nn_error("TinyDNN has not been compiled with AVX support.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write fallback call to internal backend routine global_avepool_grad_op_internal instead of throwing an exceotion.

In that way, you can remove backend type check in global_average_pooling_layer::init_backend method in tiny_dnn/layers/global_average_pooling_layer.h file.

const core::global_avepool_params &params,
const bool layer_parallelize) {
const size_t pool_area = params.in.width_ * params.in.height_;
const float_t pool_area_inv = 1.0f / pool_area;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to use float_t in here. Just float is OK. And don't forget to cast pool_area to float with static_cast, otherwise I think compiler gives a warning.

I'd use _mm_set_ss to hold pool_area value in __m128 typed variable and use _mm_rcp_ss to compute a reciprocal. In that way, you can multiply it with the result of hsum256_ps with _mm_mul_ss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use _mm_cvtsi32_ss intrinsic function to set single dword(32bit) variable to first slot in __m128.

context.parallelize());
} else {
// fallback to internal implementation as nnpack implementation
// is not available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment is outdated.

__m256i imask =
_mm256_loadu_si256((__m256i const *)(mask_src + 8 - nremains_per_channel));

for_i(layer_parallelize, in_data.size(), [&](int sample) {
Copy link
Contributor

@beru beru May 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use size_t sample instead of int sample for functor lambda's argument with for_i template function.
#679 is related.

const size_t depth_index = i * pool_area;
for (size_t j = 0; j < nblocks_per_channel; j++) {
__m256 in_m = _mm256_load_ps(&in[depth_index + 8 * j]);
sum_m = _mm256_add_ps(sum_m, in_m);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should unroll this loop 8x times.

const size_t depth_index = i * pool_area;
for (size_t j = 0; j < nblocks_per_channel; j++) {
__m256d in_m = _mm256_load_pd(&in[depth_index + 4 * j]);
sum_m = _mm256_add_pd(sum_m, in_m);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should unroll this loop 8x times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this, what do you mean by unrolling 8x times? Aren't we doing vertical additions till the end then followed by horizontal addition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related with instruction latencies and throughputs, Instruction-level parallelism, 16 YMM registers etc...
The Haswell microarchitecture, The Skylake microarchitecture, they have 8 execution ports. etc, etc..

const vec_t &curr = curr_delta[sample];

for (size_t i = 0; i < params.in.depth_; i++) {
const double pi = curr[i] * pool_area_inv;
Copy link
Contributor

@beru beru May 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should rewrite this line with AVX intrinsics functions.

__m256d sum_m = _mm256_setzero_pd();
const size_t depth_index = i * pool_area;
for (size_t j = 0; j < nblocks_per_channel; j++) {
__m256d in_m = _mm256_load_pd(&in[depth_index + 4 * j]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the address &in[depth_index + 4 * j] is always aligned to 32 bytes?
If not, you should use _mm256_loadu_pd.

@kdexd kdexd force-pushed the global-avepool-avx branch 3 times, most recently from afba6a5 to 962b229 Compare May 10, 2017 08:55
@kdexd
Copy link
Contributor Author

kdexd commented May 10, 2017

@beru I have worked with the review comments. Please have a look once more.
@edgarriba some patterns are different than other layers - such as fallback, macro protection etc. if this works well, we can make it uniform throughout.

@edgarriba
Copy link
Member

@karandesai-96 could you rename 494e1bb saying something like "add google bancharmk framework"?

Copy link
Contributor

@beru beru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if defined(_MSC_VER)
#define CNN_MUST_INLINE __forceinline
#elif defined(__GNUC__) || defined(__clang__) || defined(__ICC)
#define CNN_MUST_INLINE __attribute__((always_inline)) inline
#else
#define CNN_MUST_INLINE inline
#endif

@kdexd kdexd force-pushed the global-avepool-avx branch 2 times, most recently from b8440f2 to befed45 Compare June 5, 2017 04:49
@beru
Copy link
Contributor

beru commented Jun 5, 2017

@karandesai-96 You need to replace CNN_MUSTINLINE with CNN_MUST_INLINE in file tiny_dnn/util/product.h. How couldn't you notice the little underbar trap I added...

@kdexd
Copy link
Contributor Author

kdexd commented Jun 6, 2017

@beru Be right back, make a coffee with stronger caffeine content xD

@kdexd kdexd force-pushed the global-avepool-avx branch 2 times, most recently from 3d61f96 to 8466482 Compare June 8, 2017 04:28
Copy link
Contributor

@Randl Randl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of small fixes

CMakeLists.txt Outdated
option(BUILD_TESTS "Set to ON to build tests" OFF)
option(BUILD_EXAMPLES "Set to ON to build examples" OFF)
option(BUILD_DOCS "Set to ON to build documentation" OFF)
option(BUILD_BENCHMARKS "Set to ON to build documentation" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/documentation/benchmark/.

@@ -0,0 +1,50 @@
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Licence comment is missing

@@ -3,6 +3,7 @@

file(GLOB_RECURSE ALL_CXX_SOURCE_FILES
${CMAKE_SOURCE_DIR}/tiny_dnn/*.h
${CMAKE_SOURCE_DIR}/benchmarks/*.h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.cpp too

const core::backend_t engine = context.engine();

if (engine == core::backend_t::avx) {
#ifdef CNN_USE_AVX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that if engine is set to core::backend_t::avx and CNN_USE_AVX is undefined, nothing happens? This is kinda not intuitive and also inconsistent with other layers where we fall back to internal in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Randl I think killing core::backend_t::avx entry and all the AVX related codes when CNN_USE_AVX isn't defined would be a simpler solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean disabling the code with preprocessor directives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vote for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgarriba You need to remember that voting system is powerless before you.

#define CNN_MUSTINLINE __attribute__((always_inline)) inline
#else
#define CNN_MUSTINLINE inline
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karandesai-96 You_should_start_liking_to_insert_an_underbar_between_words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beru CNN_SCREAMING_SNAKE_CASE_FOR_THE_WIN

@edgarriba
Copy link
Member

Good to merge?

@kdexd
Copy link
Contributor Author

kdexd commented Jun 10, 2017

@edgarriba will fix the typo and license comment once I finish travelling.

@beru beru mentioned this pull request Jun 10, 2017
4 tasks
@beru beru mentioned this pull request Jun 11, 2017
@@ -12,3 +12,11 @@
#if defined _WIN32 && !defined(__MINGW32__)
#define CNN_WINDOWS
#endif

#if defined(_MSC_VER)
#define CNN_MUSTINLINE __forceinline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's sth missing between T and I. (Hint : 0x5F)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did I miss this 😕 thanks for pointing out

#elif defined(__GNUC__) || defined(__clang__) || defined(__ICC)
#define CNN_MUST_INLINE __attribute__((always_inline)) inline
#else
#define CNN_MUSTINLINE inline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to perform statement coverage test for this path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that what makes appveyor fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one #687 (comment)
is the cause.

@beru beru merged commit 494c9fc into tiny-dnn:master Jun 16, 2017
@edgarriba
Copy link
Member

cool! @beru @karandesai-96 see #766

@kdexd kdexd deleted the global-avepool-avx branch July 18, 2017 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants