Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY memory alignment SIGSEGV bug #1137

Closed
apotschka opened this issue Mar 28, 2017 · 33 comments · Fixed by #2555
Closed

COPY memory alignment SIGSEGV bug #1137

apotschka opened this issue Mar 28, 2017 · 33 comments · Fixed by #2555

Comments

@apotschka
Copy link

Minimal example check_copy.c, compiled via

$ cc -g -o check_copy check_copy.c /usr/lib/libopenblas.so.0

Error occurs for offset == 1:

$ valgrind ./check_copy
==16357== Memcheck, a memory error detector
==16357== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==16357== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==16357== Command: ./check_copy
==16357==
1.000000 1.000000
0.500000 0.500000
0.333333 0.333333
==16357==
==16357== Process terminating with default action of signal 11 (SIGSEGV)
==16357==  General Protection Fault
==16357==    at 0x64FB980: dcopy_k_HASWELL (in /usr/lib/libopenblasp-r0.2.18.so)
==16357==    by 0x6EEE82F: (below main) (libc-start.c:291)

Changing the increment of offset to offset += 8 makes the code run fine.

System:

$ uname -a
Linux *** 4.4.0-67-generic #88-Ubuntu SMP Wed Mar 8 16:34:45 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ cc --version
cc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
$ zcat /usr/share/doc/libopenblas-base/changelog.Debian.gz | head -1
openblas (0.2.18-1ubuntu1) xenial; urgency=medium
$ cat /proc/cpuinfo | head -30
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 1201.921
cache size      : 15360 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs            :
bogomips        : 6599.88
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6

I know it is not a good idea to have the vectors not aligned at 8 byte address boundaries. However, my expectation would be that the code does not crash in this case.

@martin-frbg
Copy link
Collaborator

Segfault happens on Nehalem as well, copy_sse2.S appears to be original code from libgoto. I do not have time today to investigate further - maybe there is something unusual about the testcase if it comes up only now, maybe the assembly "only" needs the proper .align in the proper place ?

@martin-frbg
Copy link
Collaborator

For the record, gdb puts the fault at line 216 of copy_sse2.S, which is movaps -16*SIZE(X),%xmm0

@apotschka
Copy link
Author

I guess the problem is violation of Intel ABI, pp. 13-14:

Like the Intel386 architecture, the AMD64 architecture in general does not require all data accesses to be properly aligned. Misaligned data accesses are slower than aligned accesses but otherwise behave identically. The only exceptions are that __m128, __m256 and __m512 must always be aligned properly.

@brada4: Is turning off SSE also for properly aligned memory really a good idea in view of performance?

I would be happy with the following resolution:

  • Keep OpenBLAS code as is. The fault really indicates a memory misalignment problem in the calling code.
  • Document crash messages with misaligned memory so that users can find them via google and fix their code. Maybe this issue is a good place in addition to an entry at the FAQ.

@brada4
Copy link
Contributor

brada4 commented Mar 29, 2017

SSE2 is omnipresent on x86_64, and it is emitted at any compiler, and _copy (Or any L1 BLAS) routines are completely memory bound. Even glibc has AVX memory copy routines nowadays, probably SSE assembly is even slower (memory bound... i.e spends 2x more CPU cycles in same time) than memcpy in the border case.
I think no big damage bypassing something that tries corrupting irrelevant memory.

@brada4
Copy link
Contributor

brada4 commented Mar 29, 2017

@martin-frbg what makes me wonder that sandybridge is not affected.

@martin-frbg
Copy link
Collaborator

Wouldn't the test case above (with offset=8 of course) allow benchmarking of sse vs. compiler-picked implementation rather than handwaving ?

@brada4
Copy link
Contributor

brada4 commented Mar 31, 2017

Most kernels assume aligned input anyway. But that does not mean need to crash in unaligned cases.
i.e they start with 2^n-sized data blocks then process tail of array in smaller blocks. To handle unaligned input optimally other tail must process head of input until aligned block start is reached.

@martin-frbg
Copy link
Collaborator

Maybe it is just the cblas interface that should ensure proper alignment before calling the kernel function (but that would seem to create the weird situation of using a potentially inefficient copy implementation just to drive the hand-optimized one) ? At least playing with the .align statements already present in copy_sse2.S does not appear to help, while replacing the two affected movaps instructions at label L15 with movups does. Still no benchmarking to see which version wins.

@martin-frbg
Copy link
Collaborator

My preliminary tests suggest that replacing the movaps instructions of copy_sse2.S with movups has almost negligible impact on Nehalem at least, while the plain C implementation from arm/copy.c is significantly slower. (320 vs 500 seconds for the aligned version of check_copy.c with nm=300 and an outer loop that does 10 million invocations of the loop on "offset").

@martin-frbg
Copy link
Collaborator

From various sources, it appears that movups is identical to movaps in performance when accessing aligned memory on Intel targets since Nehalem. For AMD processors - which I do not have available - I only find this feature expressly mentioned in their Software Optimization Guide for the "Family 16h" processors, but not for 15h (Bulldozer et al.)
As (I think) was already mentioned above, "fixing" just the xCOPY functions would probably only delay the crash with unaligned data until another optimized function is reached, and general perfomance will probably be poorer than with properly aligned data. I am not aware of a more benign way to signal to a user that they had better align their data however...

@apotschka
Copy link
Author

Why not check memory alignment on each call and fail with a meaningful error message on misalignment instead of a SIGSEGV? Checking alignment costs just an OR operation of the address with a mask of 0x3 (8 bytes), 0x7 (16 bytes), 0xF (32 bytes), etc. and a jump to the error handler if the result is nonzero. Branch prediction (no jump) should work efficiently in this case.

@martin-frbg
Copy link
Collaborator

Have to defer to @xianyi and other core developers on this, but to me this sounds expensive even if it is "just an OR operation" on every call.
Perhaps providing a set of utility functions to make it easier for a user to check their input and/or documenting the requirement of proper alignment would help already ?
copy_sse2.S appears to be unchanged from libGoto2 so the issue does not appear to arise often (though of course nobody knows how many people may have hit that segmentation fault and switched to a different blas implementation immediately).
As an immediate fix for just this specific issue, copy_sse2.S could be cloned to a new file that uses movups throughout and is referenced from just KERNEL.NEHALEM, KERNEL.SANDYBRIDGE and KERNEL.HASWELL so as not to compromise the performance on older or AMD cpus. Or would the performance difference be tiny even there - I see movups being used (via #define movapd movups even) in trsm_kernel_LN_4x4_barcelona.S and others already - so that it could just be replaced unconditionally ?

@mattip
Copy link
Contributor

mattip commented Dec 7, 2018

this sounds expensive even if it is "just an OR operation" on every call

The other option is SIGSEGV. It seems worth the cost. There are other checks on the input, why is this one singled out for exclusion?

@brada4
Copy link
Contributor

brada4 commented Dec 7, 2018

All kernels consider inputs somewhat aligned, probably bigger than size of variable itself.
Typically arguments are aligned to page by ways of malloc(), see 1907 for my (partial, incomplete) attempt to maintain it later for threading.
GEMM can survive copy+scale to aligned arguments, copy itself not. It is not SIGSEGV (like access out off alloc range), it is essentially SIGILL that instruction is not happy with alignment of legitimte access.
All solutions are very complex say
1/ add entrance sled in addition to exit sled and call kernels on aligned blocks - it involves tricky non-portable pointer arithmetics, luckily RO to pointers themselves.
2/ Go with C-only, not so optimal kernel - minimal code change
3/ Various combinations thereof, lots of coding for not much gain

@martin-frbg
Copy link
Collaborator

@mattip back then I had hoped for additional comments, but the fact that the topic went dormant seems to suggest that most callers ensure that their data is aligned, even if "only" for performance. On the other hand, my misgivings about potential performance impacts from using movups were related to hardware that was already old back then, and would be even more outdated two years on.
Running alignment checks on all input still strikes me as impractical.

@brada4
Copy link
Contributor

brada4 commented Dec 7, 2018

Though a comment, Intel claims unaligned instructions are performance-neutral.
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587

@martin-frbg
Copy link
Collaborator

#1137 (comment) - the open question is (or was) if this is the case on older (and specifically older AMD) cpus as well.

@brada4
Copy link
Contributor

brada4 commented Dec 7, 2018

That is if U on old processors is detrimental to A cases?

@martin-frbg
Copy link
Collaborator

Exactly. No need to worry about future performance of U cases when they just blow up right now.
(But what I apparently did not realize back then is that there are a lot more assembly kernels that use movaps, sometimes even intermixed with movups. So changing copy_sse2 alone would probably just shift the SIGILL to the next function. Or perhaps I did realize it, and decided to let this rest...)

@brada4
Copy link
Contributor

brada4 commented Dec 7, 2018

Why people stick to those old things, affected processors (CORE2 & OPTERON & OPTERON_SSE3 & oldest half of ATOM) are quite on par with Raspberry Pi3 for numerical computations.... (with the remark about electricity bill)

@martin-frbg
Copy link
Collaborator

Core2 generation no idea (perhaps some poor undergrads), Atom is still valid for current generation and could be used for pattern detection in camera setups or the like. At least i see no need to "kill" the old targets, or to degrade their performance even further just to cater to some unusual and on the whole undesirable special case. (Are there any examples where unaligned data can/will show up and not be easily caught by some program before it passes them to BLAS, or is this more of an academic exercise ?)

@toulorge-cenaero
Copy link

Hello, I hit this bug today on a recent Whiskey Lake processor.

To elaborate on the comments above, I would like to give my point of view as an OpenBLAS user. It seems to me that calling BLAS routines with unaligned data is not a mere theoretical exercise: it can happen in practice if you allocate a large matrix/vector in the first place and then perform BLAS operations on sub-blocks. Moreover, I think that a significant part of BLAS library users are unaware of memory alignment issues; they may move away from OpenBLAS if faced with crashes while they use it in compliance with the standard interface. Finally, even those who are aware of the problem will be reluctant to modify their code to take into account specific requirements of OpenBLAS, when they can use alternative BLAS libraries. Thus I think that the memory alignment issues should be handled on the BLAS library side rather than in the application code.

Programming in OpenBLAS is beyond the scope of my duties and skills, but I am happy to contribute to solve this bug by testing. Here is my system information:

[ttoulorg@nbk141 ~]$ cc --version
cc (GCC) 8.3.1 20190507 (Red Hat 8.3.1-4)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[ttoulorg@nbk141 ~]$ cat /proc/cpuinfo | head -26
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 142
model name	: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
stepping	: 12
microcode	: 0xca
cpu MHz		: 4282.672
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit
bogomips	: 4224.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

Thank you.

@brada4
Copy link
Contributor

brada4 commented Feb 23, 2020

@toulorge-cenaero got any backtrace?

In principle even with your theoretical excersise the values will be aligned as in the beginning, unless you resort to numerology by shifting float by bytes.

Not sure about BLAS library users you refer to - those using MKL are advised to use MKL-s internal allocator if unable to get aligned allocations from operating system.

@toulorge-cenaero
Copy link

@brada4 Here is the OpenBLAS part of the backtrace when run with my application:

#0  dcopy_k () at ../kernel/x86_64/copy_sse2.S:107
#1  0x00007ffff571d66d in dcopy_ (N=0x7fffffffb79c, x=0x7fffec5d3084, INCX=0x7fffffffb7a0, y=0x5555565134a0, INCY=0x7fffffffb7a4)
    at copy.c:78

When I run the minimal example provided in the first comment, I get the fault at line 216 of copy_sse2.S as mentioned previously.

In principle even with your theoretical excersise the values will be aligned as in the beginning, unless you resort to numerology by shifting float by bytes.

I am not sure we understand each other. Let's imagine that I allocate memory for a 4x5 float matrix at a pointer p that is 16-byte aligned. Working with this matrix should be fine. However, if I want to work on the block made out of the last four rows, that starts at index (1, 0), I will pass to SCOPY or other BLAS routines a pointer that is p + 5 floats, i.e. p + 20 bytes, which is not aligned. It seems to me that this is well illustrated by the check_copy.c minimal example provided above. Am I missing something?

Not sure about BLAS library users you refer to - those using MKL are advised to use MKL-s internal allocator if unable to get aligned allocations from operating system.

As explained above, I do not think that my problem is related to the allocation itself. I was not referring to any specific BLAS library: my point is rather that BLAS being a standard interface, it seems to me that most users expect BLAS libraries to be interchangeable so that their code is easily ported to different systems. Please do not get me wrong: I am happy and very grateful that I can use OpenBLAS on my system. Nevertheless, I am reluctant to handle requirements of specific BLAS implementations in my application beyond those of the standard interface.

@martin-frbg
Copy link
Collaborator

It is trivial to replace the movups in copy_sse2.S with movaps (or copy the attached file over kernel/x86_64/copy_sse2.S and recompile OpenBLAS) but I doubt this will be the end to the story - most likely you will hit another alignment problem elsewhere. Can you try ?
copy_sse2.txt

@brada4
Copy link
Contributor

brada4 commented Feb 23, 2020

MOVUPS takes more cycles than MOVAPS on old processors even in aligned case. (2 vs 5), it is still sustainable as long as CPUs are clocked well better than RAM.

@brada4
Copy link
Contributor

brada4 commented Feb 23, 2020

It is common sense to align 4-byte (float) to 4 bytes, yours is not, glibc memcpy() will crash same way,

@brada4
Copy link
Contributor

brada4 commented Feb 23, 2020

@martin-frbg does it look viable to check alignment in sse copy and fall back to C code in unlikley() case?

@toulorge-cenaero
Copy link

@martin-frbg Your solution allows me to correctly run both the check_copy.c example and my application. I am not so surprised that my application runs smoothly after fixing DCOPY: if there was a similar problem in the BLAS Level 2 and Level 3 routines that we mostly use, I guess somebody would have found it much earlier. Thank you for your help. I hope that this fix or a similar one will be included in the next release of OpenBLAS.

@brada4 The data is obviously 4-byte aligned here... The problem is the 16-byte alignment (i.e. the width of 4 floats) required by SSE instructions, whereas the sub-vector or sub-matrix that we want to copy does not necessarily start at an index that is a multiple of 4 within the larger allocated vector or matrix.

@MigMuc
Copy link

MigMuc commented Feb 23, 2020

The iamax_sse2 routines have both code paths implemented. The check for aligned or unaligned acess is done e.g. on line 94-100.

@martin-frbg
Copy link
Collaborator

@MigMuc thanks. Interestingly, copy_sse2.S does already have some concept (and handling) of (un)ALIGNED_ACCESS, though this attribute does not appear to be defined for HASWELL and similar "recent" kernels in either l1param.h or l2param.h (Which leads to the check being performed on Y rather than X if I read it correctly - but still if I read it correctly it would appear that those cpus where unaligned accesses would incur a performance penalty are already defining ALIGNED_ACCESS and hence hitting the other codepath.)

@MigMuc
Copy link

MigMuc commented Feb 23, 2020

Currently I am trying to write an avx2 version of the copy routines for Haswell and Zen CPUs. Hence I am using the dscal.c implementation as a starting point. When compiling I get some issues with the immintrinsics which I do like in the dscal_microk_skylakex-2.c way. I cannot find where in the Makefiles the -mavx2 switch is set for ZEN.
../kernel/x86_64/dcopy_microk_haswell.c:13:2: Warnung: AVX-Vektorrückgabe ohne eingeschaltetes AVX ändert das ABI [-Wpsabi] 13 | _mm256_storeu_pd(&y[i + 0], _mm256_loadu_pd(&x[i + 0]));

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 23, 2020

Makefile.x86_64 should be the place for this - but as (I think) none of the relevant kernels uses immintrinsics on ZEN it has not come up as a problem that -maxvx2 is so far only added for TARGET=HASWELL there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

6 participants