Compilation failed with g++ 6.3.0 #4

LaurentPlagne · 2017-05-18T07:59:21Z

Hi,
the hptt compilation failed on my machine (6700K-ubuntu 17.04-g++ 6.3.0) with the following message:

/usr/lib/gcc/x86_64-linux-gnu/6/include/avxintrin.h:994:1: error: inlining failed in call to always_inline ‘void hptt::_mm256_stream_ps(float*, hptt::__m256)’: target specific option mismatch _mm256_stream_ps (float *__P, __m256 __A) ^~~~~~~~~~~~~~~~

The compilation is OK with intel icpc 2018.

The text was updated successfully, but these errors were encountered:

solomonik · 2017-05-18T20:11:27Z

I am getting this error also with g++ 5.4.0

In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:41:0,
                 from /usr/lib/gcc/x86_64-linux-gnu/5/include/x86intrin.h:46,
                 from /usr/include/x86_64-linux-gnu/c++/5/bits/opt_random.h:33,
                 from /usr/include/c++/5/random:50,
                 from /usr/include/c++/5/bits/stl_algo.h:66,
                 from /usr/include/c++/5/algorithm:62,
                 from src/hptt.cpp:24:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h:994:1: error: inlining failed in call to always_inline ‘void _mm256_stream_ps(float*, __m256)’: target specific option mismatch
 _mm256_stream_ps (float *__P, __m256 __A)

springer13 · 2017-05-18T22:54:30Z

Could it be that you are not compiling on an AVX-enabled CPU?

I just changed the Makefile to always use -mavx irrespective of the underlying architecture. Is this solving your issues?

solomonik · 2017-05-19T02:19:26Z

@springer13 it builds, but I get the following error when trying to use it within the CTF test_suite (is there a native test I can run?), which at first sight looks possibly related to avx

vex amd64->IR: unhandled instruction bytes: 0xC5 0xFB 0x11 0x45 0xC0 0x48 0x89 0x4D
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==17057== valgrind: Unrecognised instruction at address 0x5ab804.
==17057==    at 0x5AB804: hptt::create_plan(int const*, int, double, double const*, int const*, int const*, double, double*, int const*, hptt::SelectionMethod, int, int const*) (hptt.cpp:1924)
==17057==    by 0x4DA5A0: CTF_int::nosym_transpose_hptt(int, int const*, int const*, int, char const*, char*, CTF_int::algstrct const*) (nosym_transp.cxx:385)
==17057==    by 0x4DA969: CTF_int::nosym_transpose(CTF_int::tensor*, int, int const*, int const*, int) (nosym_transp.cxx:440)
==17057==    by 0x4AFCFB: CTF_int::contraction::map_fold(bool) (contraction.cxx:738)
==17057==    by 0x4BE187: CTF_int::contraction::contract() (contraction.cxx:4216)
==17057==    by 0x4BFC8B: CTF_int::contraction::sym_contract() (contraction.cxx:4661)
==17057==    by 0x4BF77C: CTF_int::contraction::sym_contract() (contraction.cxx:4612)
==17057==    by 0x4BF77C: CTF_int::contraction::sym_contract() (contraction.cxx:4612)
==17057==    by 0x4BF77C: CTF_int::contraction::sym_contract() (contraction.cxx:4612)
==17057==    by 0x4C0BAC: CTF_int::contraction::home_contract() (contraction.cxx:4834)
==17057==    by 0x4AD79F: CTF_int::contraction::execute() (contraction.cxx:111)
==17057==    by 0x544D9E: CTF_int::Contract_Term::execute(CTF::Idx_Tensor) const (term.cxx:572)
==17057== Your program just tried to execute an instruction that Valgrind
==17057== did not recognise.  There are two possible reasons for this.
==17057== 1. Your program has a bug and erroneously jumped to a non-code
==17057==    location.  If you are running Memcheck and you just saw a
==17057==    warning about a bad jump, it's probably your program's fault.
==17057== 2. The instruction is legitimate but Valgrind doesn't handle it,
==17057==    i.e. it's Valgrind's fault.  If you think this is the case or
==17057==    you are not sure, please let us know and we'll try to fix it.
==17057== Either way, Valgrind will now raise a SIGILL signal which will
==17057== probably kill your program.

springer13 · 2017-05-19T05:11:14Z

Have you tried the scalar version? Is this working?

You can build the benchmark (see Readme.md for instructions). Alternatively, you could build the testframework. Please let me know if this is givin you any problems.

LaurentPlagne · 2017-05-19T06:38:00Z

Hi,
It builds now (-mavx): Thank you very much !

My cpu is a skylake (6700K).
g++ with -mtune=native and -march=native are OK for my other projects...

LaurentPlagne · 2017-05-19T08:22:37Z

Hi again,
I have a segfault with g++6.3 (it works with icpc).
It appears that it is caused by a 16 bytes alignment for A and B in the microkernel::execute function.
A + (0,1,2,3) * lda
and
B + (0,1,2,3) *ldb
should be 32 bytes aligned for _mm256_load_pd and _mm256_store_pd (they are 16 bytes aligned).
If I replace by the loadu and storeu conterpart the segfault goes away (but of course this is not what you want).

springer13 · 2017-05-19T08:31:11Z

Thanks for spotting this. I just pushed a fix. I am now using the unaligned loades and stores. As long as the data is aligned to 32byte boundaries there is no performance penalty. However, these instructions also work in the non-aligned case.

LaurentPlagne · 2017-05-19T08:35:49Z

OK, I am a bit surprised about the performance penalty... In my experience the unaligned store and load are slower...

springer13 · 2017-05-19T08:40:11Z

There is an easy way to test this: run the benchmark (once with storeu_ps and once with store_ps) and see if you encounter any performance penalty, usually this should not be the case. However, if you actually shift the array say to a 16byte boundary, then it might be different. This would require HPTT to peal-off the first 16 bytes and align it to a 32byte boundary; this is not done yet.

LaurentPlagne · 2017-05-19T08:43:30Z

Ok, I will check. Note that my input arrays A and B are properly 32 bytes aligned.
But their sizes (50,50,50) may cause the problem.

LaurentPlagne · 2017-05-19T08:50:19Z

Wow.. It is fast !
(50,50,50) arrays with {2,1,0} permutation -> 60 GB/s while the direct copy without transposition
is about 66GB/s !

springer13 · 2017-05-19T08:54:42Z

I am happy to hear that :) From what I have seen, HPTT achieves close to peak performance for across a wide range of tensor transpositions and sizes (see paper). Its advantages really play out once the tensors become too large to fit into the caches.

That being said, notice that 50x50x50 is too small to get reliable timings since this actually fits into L3 cache (~488 KiB).

LaurentPlagne · 2017-05-19T09:06:19Z

Yep, I should increase the array size.
BTW, I did use the execute_expert method. When I switch back to the execute() method I
have a segfault in the _mm256_stream_pd method:
`at 0x530F050: _mm256_stream_pd (avxintrin.h:990)
==32333== by 0x530F050: streamingStore (hptt.cpp:262)
==32333== by 0x530F050: macro_kernel<16, 16, 1, double, true> (hptt.cpp:587)
==32333== by 0x530F050: void hptt::transpose_int<16, 16, 1, double, true>(double const*, double const*, double*, double const*, double, double, hptt::ComputeNode const*) (hptt.cpp:675)
'

springer13 · 2017-05-19T09:11:19Z

Without the source code it is hard to tell whether this is a bug in HPTT or on your side. However, if you tell HPTT to use streaming stores via the expert interface, then HPTT will use streaming stores no matter what. Thus, you have to make sure that streaming stores are applicable. This is done to reduce the overhead due to these branches for very small tensor transpositions when such an overhead becomes noticable.

LaurentPlagne · 2017-05-19T09:17:37Z

I have the segfault with the non-expert execute method...

springer13 · 2017-05-19T09:20:37Z

can you please provide the tensor transposition and size that you have used? beta = 0? Which compiler did you use?

LaurentPlagne · 2017-05-19T09:36:33Z

const int dim = 3;
const int nx = 50;
const int ny = 50;
const int nz = 50;

int perm[dim];
perm[0]=2; perm[1]=1; perm[2]=0;
int size[dim];
size[0]=nz; size[1]=ny; size[2]=nx;

double alpha=1.0;
double beta=0.0;

const int numThreads=4;

auto plan = hptt::create_plan(perm, dim,
alpha, A, size, NULL,
beta, B), NULL,
hptt::ESTIMATE, numThreads);
plan->execute();

gcc version 6.3.0 20170406 (Ubuntu 6.3.0-12ubuntu2)

LaurentPlagne · 2017-05-19T09:43:32Z

Changing line 262
_mm256_stream_pd(out, _mm256_load_pd(in));
by
_mm256_stream_pd(out, _mm256_loadu_pd(in));

seems to solve the problem ;)

springer13 · 2017-05-19T09:53:04Z

Oh yes, good catch. Thanks. Please pull.

LaurentPlagne · 2017-05-19T09:54:18Z

Done. Thanks !

LaurentPlagne closed this as completed May 19, 2017

LaurentPlagne reopened this May 19, 2017

LaurentPlagne closed this as completed May 19, 2017

LaurentPlagne reopened this May 19, 2017

LaurentPlagne closed this as completed May 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compilation failed with g++ 6.3.0 #4

Compilation failed with g++ 6.3.0 #4

LaurentPlagne commented May 18, 2017

solomonik commented May 18, 2017

springer13 commented May 18, 2017

solomonik commented May 19, 2017 •

edited

Loading

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017 •

edited

Loading

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017 •

edited

Loading

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

Compilation failed with g++ 6.3.0 #4

Compilation failed with g++ 6.3.0 #4

Comments

LaurentPlagne commented May 18, 2017

solomonik commented May 18, 2017

springer13 commented May 18, 2017

solomonik commented May 19, 2017 • edited Loading

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017 • edited Loading

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017 • edited Loading

LaurentPlagne commented May 19, 2017

springer13 commented May 19, 2017

LaurentPlagne commented May 19, 2017

solomonik commented May 19, 2017 •

edited

Loading

LaurentPlagne commented May 19, 2017 •

edited

Loading

LaurentPlagne commented May 19, 2017 •

edited

Loading