Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS on dSPACE realtime hardware #2102

Open
mohseninima opened this issue Apr 29, 2019 · 48 comments
Open

OpenBLAS on dSPACE realtime hardware #2102

mohseninima opened this issue Apr 29, 2019 · 48 comments

Comments

@mohseninima
Copy link

mohseninima commented Apr 29, 2019

Hello,

I am currently trying to update some code to use OpenBLAS and implement it on a dSPACE 1103 PowerPC board but I am having some issues. The build steps are a little confusing and I will try my best to explain. There are 3 devices in this setup

  1. A laptop where the OpenBLAS is compiled
    -Haswell i7-4700HQ
    -Ubuntu 18.04 WSL 64 bit

  2. The host for the dSPACE system
    -Sandy Bridge-E Xeon E5-1620
    -Windows 7 64 bit
    -MATLAB 32 bit

  3. Real-time board
    -dSPACE 1103
    -Power PC 750 GX
    -Receives final compiled code from the host

I first compile OpenBLAS on the laptop using
make DYNAMIC_ARCH=1 BINARY=32 HOSTCC=gcc CC=i686-w64-mingw32-gcc FC=i686-w64-mingw32-gfortran CFLAGS='-static-libgcc -static-libstdc++ -static -ggdb' FFLAGS='-static' && mv -f libopenblas.dll.a libopenblas.lib

I then copy over the lib/dll.a/include files to the host PC. In my existing code MYCODE.c I include cblas.h and update my functions to use cblas. I then use MATLAB to compile a mex file by using the following command
mex -v MYCODE.c libopenblas.lib -g -I'openblas/include' -lmwlapack

This compiles successfully and I am able to run my mex file and obtain correct results.

Now to upload the code to the real-time board I first create a model in Simulink that uses the mex file and call that using rtwbuild('MYMODEL') which will take my C files and compile them using the PPCTools37 compiler for the real-time board PPC architecture. I then receive the error

COMPILING "..\MYCODE.c"
(F) C0005; "C:\Users\XXXX\Desktop\openblasproject\openblas\include\common.h", line 87 pos 20; could not open source file "unistd.h"

Any idea why I would be receiving this error?

Thank You

@martin-frbg
Copy link
Collaborator

This file is included from common.h when _MSC_VER is not defined (that is, when the compiler is not recognized as MSVC, some kind of Unix-like environment like mingw/msys is assumed). You can try to #define _MSC_VER in your code before it includes any of the OpenBLAS headers, or find out how the PPCTools37 compiler introduces itself in the default preprocessor defines that it adds.

@brada4
Copy link
Contributor

brada4 commented Apr 30, 2019

Matlab includes MKL BLAS. You do not need OpenBLAS at that point.
Why would yourcode.c incluse internal OpenBLAS header at all?

@mohseninima
Copy link
Author

mohseninima commented Apr 30, 2019

Answering @martin-frbg

This file is included from common.h when _MSC_VER is not defined (that is, when the compiler is not recognized as MSVC, some kind of Unix-like environment like mingw/msys is assumed). You can try to #define _MSC_VER in your code before it includes any of the OpenBLAS headers, or find out how the PPCTools37 compiler introduces itself in the default preprocessor defines that it adds.

When using #define _MSC_VER I now get the following when compiling the code for PPC.

(F) C0005; "C:\Users\XXXX\Desktop\openblasproject\openblas\include\common.h", line 117 pos 21; could not open source file "windows.h"

I did some more searching and the full compiler name is "Microtec PowerPC C/C++ Compiler 3.7"
The following compiler mirrors are defined

_MRI Defined as 1.
_MICROTEC Defined as 1.
_TARGET_PPC Defined as 1 if the compiler is being used for a PPC target;
otherwise undefined.
_VERSION The version of the compiler in literal string format.
_UNIX Defined as 1 if the compilation host is any UNIX variant.
_SOLARIS Defined as 1 if the compilation host is Solaris.
_LINUX Defined as 1 if the compilation host is Linux.
_WINDOWS Defined as 1 if the compilation host is any version of Windows.

For @brada4's question

Matlab includes MKL BLAS. You do not need OpenBLAS at that point.
Why would yourcode.c incluse internal OpenBLAS header at all?

From what I know MKL is really only optimized for x86 systems and I did not get much if any performance benefit between reference BLAS and MKL on this PPC system. I am trying to push the limits of this system and I am hoping OpenBLAS will help me do so.

@brada4
Copy link
Contributor

brada4 commented Apr 30, 2019

You build (cross-build linux to win32) on your laptop for use with 5 years old last 32bit matlab release - what does it have to do with some BSP compiler failing in a completely independent build?
Can you get errors and show command lines from that BSP compiler? That build has absolutely nothing to do with matlab or mingw gcc you used before.

@mohseninima
Copy link
Author

mohseninima commented Apr 30, 2019

It does have something to do with the previous build. The mex file is used to construct a simulink model that the PPC compiler uses in combination with the original source code to create the final compiled software that is sent to the real-time system. This is standard for the system, the dSPACE software uses MATLAB to interface with the dSPACE hardware. The only thing I am doing differently than normal is trying to incorporate OpenBLAS. The 5 year old 32 bit matlab release is required (I would love to upgrade if I could).

I have included the output and the makefile that is called when running rtwbuild('MYMODEL')
output.txt
MYMODEL.mk.txt

@brada4
Copy link
Contributor

brada4 commented Apr 30, 2019

I see no openblas built in your log nor the error you encounter....

@mohseninima
Copy link
Author

mohseninima commented Apr 30, 2019

I see no openblas built in your log nor the error you encounter....

Line 73 and 77 in output.txt

@martin-frbg
Copy link
Collaborator

Does it compile when you simply change the #if !defined(_MSC_VER) to #if 0 in line 86 of common.h (and remove the #define _MSC_VER again, as it seems to cause other problems) ? If that does the trick, we could probably change that line to something like #if !defined(_MSC_VER) && !defined(_TARGET_PPC)

@brada4
Copy link
Contributor

brada4 commented Apr 30, 2019

USER_LIBS = "C:\Users\XXXX\Desktop\openblasproject\libopenblas.lib" "C:\Program Files (x86)\MATLAB\R2013a\extern\lib\win32\microsoft\libmwlapack
I dont think ppc blas ever link with win32 lapack.
C0005 code is from visual studio compiler, nothing to do with BSP compiler.

@mohseninima
Copy link
Author

mohseninima commented May 1, 2019

@martin-frbg Setting #if 0 ends up causing the same error as defining _MSC_VER

(F) C0005; "C:\Users\XXXX\Desktop\openblasproject\openblas\include\common.h", line 117 pos 21; could not open source file "windows.h"
#include <windows.h>

@brada4 This is an error message from the PPC compiler

C0005 could not open source file “filename”

Also linking with the MATLAB libmwlapack and libmwblas works fine on their own (this is the built in MATLAB MKL). I am going step by step when implementing OpenBLAS by first replacing the BLAS before trying LAPACK.
The error messages I am receiving are before anything from libmwlapack is even loaded.

@brada4
Copy link
Contributor

brada4 commented May 1, 2019

Could you try cross-building PPC component outside matlab? One single line error for unknown command is not really helpful.
Are you certain microtec compiler emulates MSVC error messages? Is it documented anywhere in the world?

@martin-frbg
Copy link
Collaborator

I see now that the #include <windows.h> in line 117 is independent of _MSC_VER, it depends on #ifdef OS_WINDOWS. Probably best to add an #ifndef _TARGET_PPC ... #endif around that line
(or simply comment it out for now) and see where that gets you. (Cannot negate the entire ifdef OS_WINDOWS as its #else branch would again try to include unistd.h)

@mohseninima
Copy link
Author

mohseninima commented May 1, 2019

@martin-frbg
Doing that in combination with setting if !defined(_MSC_VER) to #if 0 I now receive the following
error-5-1.txt

@brada4
I can try but it might take time, it is not documented (and I am not sure if possible) to compile outside of matlab.
I am certain about the error
errorcodes

@brada4
Copy link
Contributor

brada4 commented May 1, 2019

It should be possible to compile using common make , e.g
make CC=ppc-cross.exe HOSTCC=cl.exe TARGET=PPC440 FC=NONE

@martin-frbg
Copy link
Collaborator

The "new" errors from common.h appear to be over use of the phrase __inline void somefunction rather than void __inline somefunction (and same with "int"). What is worrying is that it still appears to compile some x86 code (else it should not need to include common_x86.h, and the errors from param.h appear to be in settings for Intel CPUs as well). Wouldn't it need to create PPC code (and access an OpenBLAS compiled for PPC440 target to run on the dSPACE board rather than the x86 cpu of its Windows host ?

@brada4
Copy link
Contributor

brada4 commented May 2, 2019

The errors say you are building x86 assembly code for supposedly powerpc with a powerpc compiler.
Maybe you start taking things seriously and at least get a compiler matching your target?

Sorry if this sounds overreacting.

@martin-frbg
Copy link
Collaborator

@brada4 mind your words please...

@martin-frbg
Copy link
Collaborator

Maybe I am confused about what you are trying to compile here with the PPC compiler - is it OpenBLAS itself, or is it your own code ? (I expect you will need to compile both, as you will probably need a PPC version of the library if you want the BLAS calls to be executed on the PPC440 board - the mingw compile on your laptop created the windows-style library for the Xeon cpu only). In your own code, you would probably include openblas_config.h and cblas.h, but not OpenBLAS' internal
headers like common.h

@mohseninima
Copy link
Author

I have tried making some diagrams to show what is going on
So this is what works normally (for the past 6+ years)
works

This is what I am currently trying to do
current

and if I am understanding correctly @martin-frbg , you are proposing this? I would update the library for the new target?
new

@martin-frbg
Copy link
Collaborator

I am not sure if I understand the mex workflow - where you currently have "Reference BLAS", is this the source code or a binary ? (I expect this would have to be either sourcecode or a precompiled ppc binary if you want the BLAS calls to be executed on the dSPACE board)

@mohseninima
Copy link
Author

I am not sure if I understand the mex workflow - where you currently have "Reference BLAS", is this the source code or a binary ? (I expect this would have to be either sourcecode or a precompiled ppc binary if you want the BLAS calls to be executed on the dSPACE board)

It is the source

@martin-frbg
Copy link
Collaborator

OK, then you would probably start from the OpenBLAS source as well, and build it for TARGET=PPC440 rather than relying on the autodetection that would only see the Xeon and Windows. But whatever you do you should not include common.h in your own MYCODE.c

@brada4
Copy link
Contributor

brada4 commented May 2, 2019

In principle at the point OpenBLAS is mixed in in the picture it could be blob produced outside mex just like the "other code" in same static library format. Almost same as with win32 dll you produced earlier.
So mymodel.c would be same as mycode.c (with added matlab function wrappers) then compiled into native PPC executable.

@mohseninima
Copy link
Author

Ok, so I am now trying to make a static library for the powerpc portion of the compilation. Would this be the correct command?

make TARGET=PPC440 HOSTCC=gcc CC=powerpc-linux-gnu-gcc FC=powerpc-linux-gnu-gfortran CFLAGS='-static-libgcc -static-libstdc++ -static -ggdb' FFLAGS='-static'

I receive the following output. Compilation fails at line 1784 complaining about junk at the end of the line
failedcompile.txt

@brada4
Copy link
Contributor

brada4 commented May 3, 2019

CFLAGS should be replaced with CCOMMON_OPT and FFLAGS alike, the "standard" names are changed internally e.g. different set for building LAPACK etc.

I think you need to use your dspace compiler (gcc will work if target is really kind of Linux)

OpenBLAS does not use libstdc++ , and it actually builds static library, see Makefile.rule for all possible options (technically "ar" archive on Linux, but check with "file" command so it is same as dspace static components)

@martin-frbg
Copy link
Collaborator

The "junk at end of line" messages are from the assembler apparently not understanding register names like r14 (or more likely, misinterpreting the DCBT macro that is defined (as a L1_PREFETCH) in common_power.h. Not sure if this is due to "wrong" compiler (or rather, assembler) or some other problem.

@brada4
Copy link
Contributor

brada4 commented May 3, 2019

Probably it hits C comments that normally are stripped out by C compiler, but maybe not so everywhere in long dormant codes...

like:
xor r1,r1 // we did something

@brada4
Copy link
Contributor

brada4 commented May 3, 2019

@mohseninima could you try building with make -i and all your parameters (after make clean)
It will run build through showing all errors, no working library produced, just that it will not stop on first encountered error.

@martin-frbg
Copy link
Collaborator

@brada4 no, it is stumbling over instructions like DCBT(A01,PREA) (where "A01" is #defined to be r14 , "PREA" is r24) that expand to dcbt r24, r14 according to common_power.h. No comments anywhere near these. As far as I can tell, PPC440 support in OpenBLAS is inherited from the original GotoBLAS.

@mepholic
Copy link

I'm hitting this issue trying to build with target PPC970. After looking at this carefully, and running the affected files through the preprocessor (ex: cpp -I src/OpenBLAS-0.3.7 src/OpenBLAS-0.3.7/kernel/power/gemm_kernel_altivec.S), you can see that the DCBT macro is interpreted as something like: dcbt 8, 29, 7

The IBM Documentation claims:

The dcbt instruction serves as both a basic and extended mnemonic. The dcbt mnemonic with three operands is the basic form, and the dcbt with two operands is the extended form. In the extended form, the TH field is omitted and assumed to be 0b0000.

The example assembly code on that page goes on to show them using the extended mnemonic, but does not make use of the basic mnemonic. My assembler (GNU assembler version 2.32 (powerpc64-foxkit-linux-musl) using BFD version (GNU Binutils) 2.32) does not seem to like the "basic mnemonic" syntax with 3 arguments to the dcbt instruction. When I attempt to build OpenBLAS, I get failures in the same files as @mohseninima did above. For instance:

../kernel/power/gemm_kernel_altivec.S: Assembler messages:
../kernel/power/gemm_kernel_altivec.S:348: Error: junk at end of line: `7'
../kernel/power/gemm_kernel_altivec.S:355: Error: junk at end of line: `8'
../kernel/power/gemm_kernel_altivec.S:409: Error: junk at end of line: `7'
make[1]: *** [Makefile.L3:671: strmm_kernel_LN.o] Error 1

I'm not sure which assembler this code was written for or tested on, but it appears that GNU assembler is not one of them.

Additionally, I question the macro definition of DCBT() in this code, which is the chosen form for my hardware based on the defines in the immediately preceding code (I'm building on Linux, not Darwin or FreeBSD). Regardless, it appears that the DCBT() macro reverses the order of the arguments passed to dcbt, which seems to counter the IBM documentation linked above. I'm not sure if it's intentional or not, but it certainly doesn't look right.

I have hardware that I can readily test fixes on. Any help that can be provided in fixing this issue would be greatly appreciated! For reference, my host machine is an IBM Power9, which should be backwards compatible with the PPC970.

@mepholic
Copy link

mepholic commented Oct 13, 2019

I actually have an update! I found this bug in sourceware's tracker. They mention that the 3 argument form of the dcbt instruction is only supported on POWER4 and newer, so you need to pass the -mpower4 flag during assembly.

I've applied some changes to my source tree which seems to cause the package to build successfully at least, but now tests are failing:

./openblas_utest
TEST 1/23 amax:samax [OK]
TEST 2/23 drotmg:drotmg_D1_big_D2_big_flag_zero [OK]
TEST 3/23 drotmg:rotmg_D1eqD2_X1eqX2 [OK]
TEST 4/23 drotmg:rotmg_issue1452 [OK]
TEST 5/23 drotmg:rotmg [OK]
TEST 6/23 axpy:caxpy_inc_0 [OK]
TEST 7/23 axpy:saxpy_inc_0 [OK]
TEST 8/23 axpy:zaxpy_inc_0 [OK]
TEST 9/23 axpy:daxpy_inc_0 [OK]
TEST 10/23 zdotu:zdotu_offset_1 [OK]
TEST 11/23 zdotu:zdotu_n_1 [OK]
TEST 12/23 dsdot:dsdot_n_1 [OK]
TEST 13/23 swap:cswap_inc_0 [OK]
TEST 14/23 swap:sswap_inc_0 [OK]
TEST 15/23 swap:zswap_inc_0 [OK]
TEST 16/23 swap:dswap_inc_0 [OK]
TEST 17/23 rot:csrot_inc_0 [FAIL]
  ERR: test_rot.c:109  expected -2.148e-01, got 3.125e-01 (diff -5.273e-01, tol 1.000e-04)
TEST 18/23 rot:srot_inc_0 [OK]
TEST 19/23 rot:zdrot_inc_0 [FAIL]
  ERR: test_rot.c:71  expected -2.148e-01, got 3.125e-01 (diff -5.273e-01, tol 1.000e-13)
TEST 20/23 rot:drot_inc_0 [OK]
TEST 21/23 potrf:smoketest_trivial [OK]
TEST 22/23 potrf:bug_695 [OK]
TEST 23/23 kernel_regress:skx_avx [OK]
RESULTS: 23 tests (21 ok, 2 failed, 0 skipped) ran in 24 ms

Here is a complete build log: buildlog.txt

I haven't traced csrot or zdrot yet, but I'm still wondering if the argument order of dcbt is an issue; I'll be testing that soon, and attempting to root cause the failure of these tests, but I think I'm nearing the point of needing an expert.

The patch I provided in the link above is fairly naive, and I'd imagine that there's other older PowerPC chips that might require the processor revision to be specifically set.

I'm also still curious about the set of defines that determines the format for the DCBT instruction. It seems to indicate that the 2 argument format should be used on PPC970 ONLY on FreeBSD or Darwin. Why is this the case?

@mepholic
Copy link

I've determined that the test failure is likely related to issue #1469. I believe that the implementation that is failing on PPC970 is contained in zrot.S (as stated by kernel/Makefile.L1) as the kernel/power/KERNEL.PPC970 file defines no CROTKERNEL or ZROTKERNEL.

I've patched KERNEL.PPC970 with some modifications that are similar to this commit following the rationale from the previously mentioned issue, and it seems to build and pass tests.

It would be fantastic to have the Assembly intrinsics for the older PowerPC chips; considering that the hardware is less powerful than modern chips, getting more performance out of the software would be a nice benefit.

I should probably add that our Power9 build host runs in Big Endian mode. Please let me know how you're going to proceed with fixing the issue with the PPC970 target. As I said before, I'm happy to help out in any way that I can! :)

@martin-frbg
Copy link
Collaborator

Seems you are already on the right track with your changes. There is no telling if/how/where the DCBT macro ever worked on PPC970 - git blame confirms that the original choice was inherited from GotoBLAS2 (for which next to no version history was available) and the list of exceptions only got added to whenever somebody hit this problem with a particular operating system.
(As an aside, the current POWER8/POWER9 kernels are Little Endian-only)

@mepholic
Copy link

mepholic commented Oct 13, 2019

@martin-frbg Interestingly enough, the POWER8 kernel seems to build and pass all tests on Big Endian. The utest log is here.

However, the POWER9 kernel does not pass the tests. If you'd like, I could open another issue for the POWER9 test failures on Big Endian. Despite IBM's messaging, there are distributions out there that use and actively develop Big Endian POWER9.

Strangely enough, I seem to run into an intermittent test failure on some (TARGET=PPC970) builds due to a SIGILL.
Here's a link to the relevant part of the test log. I'll keep an eye on it and try and run a debugger the next time I see the test fail.

@brada4
Copy link
Contributor

brada4 commented Oct 13, 2019

If you could re-run failing sample in gdb and disassemble the failing instruction....
Power omits some PowerPC (aka G5) instructions, that are emulated in AIX, but not Linux.

@mepholic
Copy link

For completeness, here is the full patch that I'm currently using to build and test the PPC970 target.

I seem to have gotten another build where sblat3 failed with SIGILL on the initial test run. Infuriatingly enough, it doesn't seem to be doing it again when the same test is rerun. I've run it over and over again in GDB hoping to catch the issue, but the process is continuously exiting successfully. GDB's environment has OPENBLAS_NUM_THREADS=2 set, and I'm getting output like this:

(gdb) !rm SBLAT3.SUMM
(gdb) run
Starting program: /home/djt/Adelie/Packages/user/openblas/src/OpenBLAS-0.3.7/test/sblat3 < ./sblat3.dat
[New LWP 9814]
[LWP 9814 exited]
[Inferior 1 (process 9813) exited normally]

Here's the contents of SBLAT3.SUMM. It seems strange to me that it is reporting a FATAL ERROR and not exiting with an error code. Is this an expected failure?

I'm not familiar with Fortran enough to understand where or why it could be failing. I assume that Fortran is calling back to PowerPC ASM/C implementations of these methods. I'm also not very experienced at using GDB beyond running programs and setting breakpoints in binaries that have debug symbols. Any help in further debugging this issue would be much appreciated!

@brada4
Copy link
Contributor

brada4 commented Oct 14, 2019

This is assembly instruction,not definitely completely illegal like unsupported by cpu, maybe getting unaligned arguments to otherwise legitimate instruction, it does not matter which compiler generated it, it is wronfly permited by -m(arch) flags. Probably test needs to run few more times to crash if it did do once.

@martin-frbg
Copy link
Collaborator

I think for POWER9 you would need to try the current develop branch, with 0.3.7 you will be missing at least the fix for TRMM. And maybe I was overly pessimistic about big-endian support - there is the open issue #1997 but the primary obstacle there seemed to be limitations of the AIX assembler.

@mepholic
Copy link

I seem to have caught the issue in gdb! Logs are here

I'm pretty sure that's what you're looking for, but I've got gdb still open just in case. I realized that I don't actually have debug symbols enabled... I'm going to rebuild another copy with them enabled and see if I can get it to croak in the meantime.

Also of note, considering we're using musl libc here:

<@awilfox> the SIGILL is a red herring
<@awilfox> that's musl's way of forcing a crash quickly when __stack_chk_fail is hit
<@awilfox> __stack_chk_fail purposefully has an invalid opcode to force the program to crash *now* to prevent a buffer attack
<@awilfox> what is actually happening is that the stack canary is being overwritten

@martin-frbg
Copy link
Collaborator

ssyr2k_kernel_U is common code (built from driver/level3/syr2k_kernel.c) so if something is trashing the stack it is probably the sgemm kernel.

@brada4
Copy link
Contributor

brada4 commented Oct 14, 2019

Try to build with MAX_STACK_ALLOC=0
It could happen that stack allocations fill some values while something else later expects zeroes, and gets confused.

@mepholic
Copy link

So I've tried a few different things:

Try to build with MAX_STACK_ALLOC=0
It could happen that stack allocations fill some values while something else later expects zeroes, and gets confused.

This didn't have any effect. It still seems to crash in the same place.

musl libc's default stack size is much smaller (128k by default) than glibc. I tried building with -Wl,--stack,4194304 in the linker flags and it seemed to fail at the same place. Regardless, it appears that the OpenBLAS server thread calls pthread_attr_setstacksize with a smaller value. I'm not exactly sure if this alters my linker flag stack size setting in the child threads (where the code actually seems to be failing). This also seems to still fail at around the same place.

Lastly, I disabled all compiler optimizations by removing all optimization flags and adding -O0. Now I'm experiencing a "halting problem". Now it appears that both the sblat3 and the cblat3 tests never exit, and I can't tell if they ever will. I let cblat3 run overnight in gdb, and it never halted or returned.

The un-optimized code appears to be spending most of it's time around here in the callstack:

^C
Thread 1 "sblat3" received signal SIGINT, Interrupt.
0x0000000100034518 in ssyr2k_kernel_U (m=31, n=31, k=1, alpha_r=1, a=0x3ffff5d17a80, b=0x3ffff5d70680, c=0x3ffffffee92c, ldc=32, offset=0, flag=1) at syr2k_kernel.c:179
179                 subbuffer[i + j * nn] + subbuffer[j + i * nn];
(gdb) bt
#0  0x0000000100034518 in ssyr2k_kernel_U (m=31, n=31, k=1, alpha_r=1, a=0x3ffff5d17a80, b=0x3ffff5d70680, c=0x3ffffffee92c, ldc=32, offset=0, flag=1) at syr2k_kernel.c:179
#1  0x00000001000303e4 in ssyr2k_UN (args=0x3ffffffd8e00, range_m=0x0, range_n=0x3ffffffd6130, sa=0x3ffff5d17a80, sb=0x3ffff5d70680, dummy=0) at level3_syr2k.c:205
#2  0x000000010004bf24 in exec_blas (num=1, queue=0x3ffffffd6338) at blas_server.c:826
#3  0x0000000100049734 in syrk_thread (mode=256, arg=0x3ffffffd8e00, range_m=0x0, range_n=0x0, function=0x100030038 <ssyr2k_UN>, sa=0x3ffff5d17a80, sb=0x3ffff5d70680, nthreads=2) at syrk_thread.c:187
#4  0x0000000100019adc in ssyr2k_ (UPLO=0x3ffffffd9518 "U", TRANS=0x3ffffffd9508 "N擄T\327\f\277N", N=0x3ffffffd91ac, K=0x3ffffffd9188, alpha=0x3ffffffd9170, a=0x3ffffffd9f18, ldA=0x3ffffffd9198, b=0x3ffffffe2320, ldB=0x3ffffffd919c, beta=0x3ffffffd9174, c=0x3ffffffee92c, ldC=0x3ffffffd91a0) at syr2k.c:382
#5  0x00000001000037b8 in schk5_ ()
#6  0x0000000100017344 in MAIN__ ()
#7  0x00000001000178e4 in main ()

I found that I can get a full annotated dump of the assembly for these binaries with a command like objdump -lSd cblat3, here's the output from that if it helps.

I'm not exactly sure what my next steps should be.

@brada4
Copy link
Contributor

brada4 commented Oct 16, 2019

In first backtrace after 0x0000000100028e94 starts fancy ret with stack check , what is not clear to me - were calls leading there also checking stack or that heavy return happens only when you "exit library"

Call profile is same as x86_64.

@martin-frbg
Copy link
Collaborator

Building with TARGET=PPC970 on POWER9 (ppc64le, gcc-9.2.1) does not give me any crashes,
but both SGEMM&friends and the corresponding single-precision complex functions all fail their tests.
(No complaints about the DCBT macro there, but I had to apply the crot/zrot fix)

@martin-frbg
Copy link
Collaborator

Changing the SGEMM and CGEMM kernels in KERNEL.PPC970 to the non-altivec versions as used for DGEMM and ZGEMM (with the corresponding blanking of the INCOPY/ITCOPY and -OBJ entries, and changing of the GEMM_UNROLL_M to 4 and 2 respectively) fixed "my" problem, and will probably work for you as well.

@awilfox
Copy link

awilfox commented Oct 16, 2019

(No complaints about the DCBT macro there, but I had to apply the crot/zrot fix)

As noted earlier and in the binutils bug, that macro issue won't happen on LE because binutils as(1) defaults to POWER8 ISA on LE (which is probably wrong and may stop being the default moving forward, since the OpenPOWER FPGAs don't implement ISA 2.07 but are LE). On BE, it defaults to something much lower (possibly POWER2?) which doesn't support the three-operand dcbt and needs -mpower4 (or later) passed. Just FYI, that is why that is happening.

@martin-frbg
Copy link
Collaborator

@awilfox thanks for the reminder, I had indeed lost track of the context of the dcbt problem. Unfortunately even the old GotoBLAS snapshots do not provide much insight into this choice of macro - GotoBLAS-1.00 from 2006 already had PPC970 support, and DCBT at that time was defined (unconditionally) as
#define DCBT(REGA, REGB, NUM) .long (0x7c00022c | (REGA << 16) | (REGB << 11) | ((NUM) << 21)) (which is still preserved as a comment in the current common_power.h). This was only used once, #ifdef linux, in the zgemv_t kernel (with the default action a two-operand dcbt).
The DCBT_ARG was introduced with (or before) 1.07, with the "0" value reserved for POWER3 and POWER5 (which were both already supported targets in 1.00). 1.07 had something like a changelog, and a comment there states Some early POWER5 won't accept DCBT extended format. (And the FAQ at that time had an entry for POWER5 owners that they "may modify it" back to the generic DCBT_ARG 8). By then DCBT was already used in axpy,gemv,gemm,trmm and trsm with no distinction on operating system.
Perhaps this was just an ugly workaround for a poorly understood problem all along, and the two-operand form should have been used universally from the start ?

@martin-frbg
Copy link
Collaborator

This is what I have come up with so far (excluding the DCBT bug - do not see it on big-endian opensuse so probably recent binutils have changed their baseline to power4. Still probably makes sense to drop the "Darwin or BSD" requirement on the conditional for two-operand dcbt).

ppc970be.diff.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants