New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX implementation for Global Average Pooling layer. #687
Conversation
@beru I think we should decide specific naming conventions for these vector blocks. Just for the sake of consistency. Right now to avoid confusion, I preferred suffixing such variables with a |
@karandesai-96 I say NAY to Systems Hungarian notation. |
const __m128 fourSum = _mm_hadd_ps(twoSum, twoSum); | ||
return fourSum; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check this page. http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86
It looks like this is faster.
// in : ( x3, x2, x1, x0 )
// out : ( -, -, -, x3+x2+x1+x0 )
inline __m128 hsum128_ps(__m128 x)
{
// loDual = ( -, -, x1, x0 )
const __m128 loDual = x;
// hiDual = ( -, -, x3, x2 )
const __m128 hiDual = _mm_movehl_ps(x, x);
// sumDual = ( -, -, x1+x3, x0+x2 )
const __m128 sumDual = _mm_add_ps(loDual, hiDual);
// lo = ( -, -, -, x0+x2 )
const __m128 lo = sumDual;
// hi = ( -, -, -, x1+x3 )
const __m128 hi = _mm_shuffle_ps(sumDual, sumDual, 0x1);
// sum = ( -, -, -, x0+x1+x2+x3 )
const __m128 sum = _mm_add_ss(lo, hi);
return sum;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was used earlier but I don't need it anymore.
context.parallelize()); | ||
} else { | ||
throw nn_error("Not supported engine: " + to_string(engine)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think all backends should ready its own versions of every kernels as it's highly likely impossible.
so I would simply write fallback routine in this way.
if (engine == core::backend_t::avx) {
#ifdef CNN_USE_DOUBLE
// todo (kd): add avx implementation for CNN_USE_DOUBLE
kernels::global_avepool_grad_op_internal(prev_delta, curr_delta, params,
context.parallelize());
#else
kernels::global_avepool_grad_op_avx(prev_delta, curr_delta, params,
context.parallelize());
#endif
} else {
kernels::global_avepool_grad_op_internal(prev_delta, curr_delta, params,
context.parallelize());
}
context.parallelize()); | ||
} else { | ||
throw nn_error("Not supported engine: " + to_string(engine)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write fallback routine in this way.
if (engine == core::backend_t::avx) {
#ifdef CNN_USE_DOUBLE
// todo (kd): add avx implementation for CNN_USE_DOUBLE
kernels::global_avepool_op_internal(in_data, out_data, params,
context.parallelize());
#else
kernels::global_avepool_op_avx(in_data, out_data, params,
context.parallelize());
#endif
} else {
kernels::global_avepool_op_internal(in_data, out_data, params,
context.parallelize());
}
backend_type == backend_t::nnpack) { | ||
if (backend_type == core::backend_t::internal || | ||
backend_type == core::backend_t::avx || | ||
backend_type == core::backend_t::nnpack) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if other backends will appear in future?
Isn't it inconvenient that implementors have to update this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in all the layers right now. I couldn't find a better alternative hence went for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we can always fallback to internal backend when selected backend doesn't support layer's operation.
sum_m = _mm256_add_ps(sum_m, in_m); | ||
} | ||
out[i] = _mm_cvtss_f32(hsum256_ps(sum_m)); | ||
out[i] /= pool_area; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is advised to use reciprocal multiplication instead of division.
-1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, | ||
}; | ||
__m256i imask = _mm256_loadu_si256( | ||
(__m256i const *)(mask_src + 8 - nremains_per_channel)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should make this mask variable outside of the loop.
size_t j = 0; | ||
|
||
while (j < nblocks_per_channel) { | ||
__m256 prev0 = _mm256_set1_ps(pi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of making this variable inside the inner loop even though the value is always the same.
const vec_t &curr = curr_delta[sample]; | ||
|
||
for (size_t i = 0; i < params.in.depth_; i++) { | ||
const float_t pi = curr[i] / pool_area; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is advised to replace division with reciprocal multiplication.
namespace tiny_dnn { | ||
namespace kernels { | ||
|
||
#ifdef CNN_USE_AVX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global_avepool_op_avx
and global_avepool_grad_op_avx
functions are referenced even when CNN_USE_AVX
isn't defined. So if you reference them, you should keep them.
In tiny_dnn/core/kernels/conv2d_op_avx.h
file, I divided interface function conv2d_op_avx
and implementation function avx_conv2d_5x5_kernel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rectified this one - similar to what is done in conv2d and fully connected kernels.
const core::global_avepool_params ¶ms, | ||
const bool layer_parallelize) { | ||
const size_t pool_area = params.in.width_ * params.in.height_; | ||
const size_t pool_area_inv = 1.0f / pool_area; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's impossible for size_t
type variable to retain floating point value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to write test code for new implementation.
CNN_UNREFERENCED_PARAMETER(out_data); | ||
CNN_UNREFERENCED_PARAMETER(params); | ||
CNN_UNREFERENCED_PARAMETER(layer_parallelize); | ||
throw nn_error("TinyDNN has not been compiled with AVX support."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write fallback call to internal backend routine global_avepool_op_internal
instead of throwing an exceotion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC that's what's done now in all layers (exception). Probably fallback to internal
backend with warning is better solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm @Randl and @karandesai-96 are right, I didn't check other layers implementation.
Giving warning at compile time with #pragma message
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I thought too
CNN_UNREFERENCED_PARAMETER(curr_delta); | ||
CNN_UNREFERENCED_PARAMETER(params); | ||
CNN_UNREFERENCED_PARAMETER(layer_parallelize); | ||
throw nn_error("TinyDNN has not been compiled with AVX support."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write fallback call to internal backend routine global_avepool_grad_op_internal
instead of throwing an exceotion.
In that way, you can remove backend type check in global_average_pooling_layer::init_backend
method in tiny_dnn/layers/global_average_pooling_layer.h
file.
const core::global_avepool_params ¶ms, | ||
const bool layer_parallelize) { | ||
const size_t pool_area = params.in.width_ * params.in.height_; | ||
const float_t pool_area_inv = 1.0f / pool_area; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to use float_t
in here. Just float
is OK. And don't forget to cast pool_area
to float
with static_cast
, otherwise I think compiler gives a warning.
I'd use _mm_set_ss
to hold pool_area
value in __m128
typed variable and use _mm_rcp_ss
to compute a reciprocal. In that way, you can multiply it with the result of hsum256_ps
with _mm_mul_ss
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use _mm_cvtsi32_ss
intrinsic function to set single dword(32bit) variable to first slot in __m128
.
context.parallelize()); | ||
} else { | ||
// fallback to internal implementation as nnpack implementation | ||
// is not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above comment is outdated.
__m256i imask = | ||
_mm256_loadu_si256((__m256i const *)(mask_src + 8 - nremains_per_channel)); | ||
|
||
for_i(layer_parallelize, in_data.size(), [&](int sample) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use size_t sample
instead of int sample
for functor lambda's argument with for_i
template function.
#679 is related.
const size_t depth_index = i * pool_area; | ||
for (size_t j = 0; j < nblocks_per_channel; j++) { | ||
__m256 in_m = _mm256_load_ps(&in[depth_index + 8 * j]); | ||
sum_m = _mm256_add_ps(sum_m, in_m); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should unroll this loop 8x times.
const size_t depth_index = i * pool_area; | ||
for (size_t j = 0; j < nblocks_per_channel; j++) { | ||
__m256d in_m = _mm256_load_pd(&in[depth_index + 4 * j]); | ||
sum_m = _mm256_add_pd(sum_m, in_m); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should unroll this loop 8x times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this, what do you mean by unrolling 8x times? Aren't we doing vertical additions till the end then followed by horizontal addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related with instruction latencies and throughputs, Instruction-level parallelism, 16 YMM registers etc...
The Haswell microarchitecture, The Skylake microarchitecture, they have 8 execution ports. etc, etc..
const vec_t &curr = curr_delta[sample]; | ||
|
||
for (size_t i = 0; i < params.in.depth_; i++) { | ||
const double pi = curr[i] * pool_area_inv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should rewrite this line with AVX intrinsics functions.
__m256d sum_m = _mm256_setzero_pd(); | ||
const size_t depth_index = i * pool_area; | ||
for (size_t j = 0; j < nblocks_per_channel; j++) { | ||
__m256d in_m = _mm256_load_pd(&in[depth_index + 4 * j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the address &in[depth_index + 4 * j]
is always aligned to 32 bytes?
If not, you should use _mm256_loadu_pd
.
afba6a5
to
962b229
Compare
@beru I have worked with the review comments. Please have a look once more. |
@karandesai-96 could you rename 494e1bb saying something like "add google bancharmk framework"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if defined(_MSC_VER)
#define CNN_MUST_INLINE __forceinline
#elif defined(__GNUC__) || defined(__clang__) || defined(__ICC)
#define CNN_MUST_INLINE __attribute__((always_inline)) inline
#else
#define CNN_MUST_INLINE inline
#endif
b8440f2
to
befed45
Compare
@karandesai-96 You need to replace |
@beru Be right back, make a coffee with stronger caffeine content xD |
3d61f96
to
8466482
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of small fixes
CMakeLists.txt
Outdated
option(BUILD_TESTS "Set to ON to build tests" OFF) | ||
option(BUILD_EXAMPLES "Set to ON to build examples" OFF) | ||
option(BUILD_DOCS "Set to ON to build documentation" OFF) | ||
option(BUILD_BENCHMARKS "Set to ON to build documentation" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/documentation/benchmark/
.
@@ -0,0 +1,50 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Licence comment is missing
@@ -3,6 +3,7 @@ | |||
|
|||
file(GLOB_RECURSE ALL_CXX_SOURCE_FILES | |||
${CMAKE_SOURCE_DIR}/tiny_dnn/*.h | |||
${CMAKE_SOURCE_DIR}/benchmarks/*.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*.cpp
too
const core::backend_t engine = context.engine(); | ||
|
||
if (engine == core::backend_t::avx) { | ||
#ifdef CNN_USE_AVX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that if engine
is set to core::backend_t::avx
and CNN_USE_AVX
is undefined, nothing happens? This is kinda not intuitive and also inconsistent with other layers where we fall back to internal in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Randl I think killing core::backend_t::avx
entry and all the AVX related codes when CNN_USE_AVX
isn't defined would be a simpler solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean disabling the code with preprocessor directives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vote for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edgarriba You need to remember that voting system is powerless before you.
tiny_dnn/util/macro.h
Outdated
#define CNN_MUSTINLINE __attribute__((always_inline)) inline | ||
#else | ||
#define CNN_MUSTINLINE inline | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karandesai-96 You_should_start_liking_to_insert_an_underbar_between_words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beru CNN_SCREAMING_SNAKE_CASE_FOR_THE_WIN
Good to merge? |
@edgarriba will fix the typo and license comment once I finish travelling. |
8466482
to
af7cd45
Compare
tiny_dnn/util/macro.h
Outdated
@@ -12,3 +12,11 @@ | |||
#if defined _WIN32 && !defined(__MINGW32__) | |||
#define CNN_WINDOWS | |||
#endif | |||
|
|||
#if defined(_MSC_VER) | |||
#define CNN_MUSTINLINE __forceinline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's sth missing between T
and I
. (Hint : 0x5F)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did I miss this 😕 thanks for pointing out
tiny_dnn/util/macro.h
Outdated
#elif defined(__GNUC__) || defined(__clang__) || defined(__ICC) | ||
#define CNN_MUST_INLINE __attribute__((always_inline)) inline | ||
#else | ||
#define CNN_MUSTINLINE inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to perform statement coverage test for this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that what makes appveyor fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one #687 (comment)
is the cause.
af7cd45
to
ebdd564
Compare
Since global average pooling layer calculates average of all activations per channel, we pick up contigious 8 floats and keep on performing vertical sum channelwise. At the end a net sum is accumulated by horizontal sum. This is repeated for all channels of a layer.
Current code falls back to internal backend if nnpack or other unsupported backend is chosen.