Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.3.21 Segmentation Fault on Power9 #3738

Closed
ocelotStride opened this issue Aug 23, 2022 · 15 comments
Closed

Version 0.3.21 Segmentation Fault on Power9 #3738

ocelotStride opened this issue Aug 23, 2022 · 15 comments

Comments

@ocelotStride
Copy link

The following build is successful on version 0.3.20 but fails on 0.3.21. I've tried enabling and disabling various options to no success

Screenshot 2022-08-23 at 16 52 01

@RajalakshmiSR
Copy link

@ocelotStride Can you paste the command that gives segmentation fault during make? Also please share full make command used.

@ocelotStride
Copy link
Author

The Makefile.rule has Interface64 enabled, Binary 64 set, relapack set

The make commands that fail are:
make
make USE_MASS=1 TARGET=POWER9
make USE_MASS=1

@RajalakshmiSR
Copy link

@brada4
Copy link
Contributor

brada4 commented Aug 23, 2022

can you record full build output? do you run make clean between builds?

@hhorak
Copy link

hhorak commented Aug 24, 2022

We see a build failure on power as well in Fedora, the furthest I got was this test failing:

$ OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 gdb ./test_sbgemm
(gdb) run
Starting program: /home/builder/rpmbuild/BUILD/openblas-0.3.21/serial64/test/test_sbgemm 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
sgemm_beta_POWER9 () at ../kernel/power/gemm_beta.S:115
115		STFD	f0,   0 * SIZE(CO1)
(gdb) bt
#0  sgemm_beta_POWER9 () at ../kernel/power/gemm_beta.S:115
(gdb) l
110		mtspr	CTR, r0
111		ble	LL(15)
112		.align 4
113	
114	LL(12):
115		STFD	f0,   0 * SIZE(CO1)
116		STFD	f0,   1 * SIZE(CO1)
117		STFD	f0,   2 * SIZE(CO1)
118		STFD	f0,   3 * SIZE(CO1)
119		STFD	f0,   4 * SIZE(CO1)

https://bugzilla.redhat.com/show_bug.cgi?id=2120974#c2

@martin-frbg
Copy link
Collaborator

Can you (both) please try with PR #3718 applied ? Basically the SBGEMM test up to and including 0.3.21 does/did not handle INTERFACE64 at all without it, so you'd see half of the "long int" argument trash the stack although the actual BLAS kernels it was supposed to test are correct. (What 0.3.21 did to aggravate the problem was enable the bfloat16 codes by default)

@RajalakshmiSR
Copy link

@martin-frbg Is there a way to backport PR 3718 to 0.3.21?

@aekoroglu
Copy link

aekoroglu commented Aug 24, 2022

I confirmed that #3718 works for segmentation fault problem in fedora-ppc64le

rm -f ?BLAT2.SUMM
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat2 < ./sblat2.dat
rm -f ?BLAT3.SUMM
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./test_sbgemm > SBBLAT3.SUMM
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./sblat3 < ./sblat3.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat2 < ./dblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat3 < ./dblat3.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat2 < ./cblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat3 < ./cblat3.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat2 < ./zblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat3 < ./zblat3.dat
make[1]: Leaving directory '/builddir/build/BUILD/openblas-0.3.21/serial64/test'
/usr/bin/make -j 28 -C utest all
make[1]: Entering directory '/builddir/build/BUILD/openblas-0.3.21/serial64/utest'

But ismin tests are failing now

./openblas_utest
TEST 1/38 max:smax_zero [OK]
TEST 2/38 max:dmax_positive [OK]
TEST 3/38 max:smax_negative [OK]
TEST 4/38 min:smin_zero [OK]
TEST 5/38 min:dmin_positive [OK]
TEST 6/38 min:smin_negative [OK]
TEST 7/38 amax:damax [OK]
TEST 8/38 amax:samax [OK]
TEST 9/38 ismax:negative_step_2 [FAIL]
ERR: test_ismin.c:89 expected 9, got 50
TEST 10/38 ismax:positive_step_2 [OK]
TEST 11/38 ismin:negative_step_2 [FAIL]
ERR: test_ismin.c:63 expected 9, got 50
TEST 12/38 ismin:positive_step_2 [OK]
TEST 13/38 drotmg:drotmg_D1_big_D2_big_flag_zero [OK]
TEST 14/38 drotmg:rotmg_D1eqD2_X1eqX2 [OK]
TEST 15/38 drotmg:rotmg_issue1452 [OK]
TEST 16/38 drotmg:rotmg [OK]
TEST 17/38 axpy:caxpy_inc_0 [OK]
TEST 18/38 axpy:saxpy_inc_0 [OK]
TEST 19/38 axpy:zaxpy_inc_0 [OK]
TEST 20/38 axpy:daxpy_inc_0 [OK]
TEST 21/38 zdotu:zdotu_offset_1 [OK]
TEST 22/38 zdotu:zdotu_n_1 [OK]
TEST 23/38 dsdot:dsdot_n_1 [OK]
TEST 24/38 swap:cswap_inc_0 [OK]
TEST 25/38 swap:sswap_inc_0 [OK]
TEST 26/38 swap:zswap_inc_0 [OK]
TEST 27/38 swap:dswap_inc_0 [OK]
TEST 28/38 rot:csrot_inc_0 [OK]
TEST 29/38 rot:srot_inc_0 [OK]
TEST 30/38 rot:zdrot_inc_0 [OK]
TEST 31/38 rot:drot_inc_0 [OK]
TEST 32/38 dnrm2:dnrm2_tiny [OK]
TEST 33/38 dnrm2:dnrm2_inf [OK]
TEST 34/38 potrf:smoketest_trivial [OK]
TEST 35/38 potrf:bug_695 [OK]
TEST 36/38 kernel_regress:skx_avx [OK]
TEST 37/38 fork:safety [OK]
TEST 38/38 fork:safety_after_fork_in_parent [OK]
RESULTS: 38 tests (36 ok, 2 failed, 0 skipped) ran in 60038 ms

@RajalakshmiSR
Copy link

@aekoroglu What is the compiler version used?

@aekoroglu
Copy link

sh-5.1# gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/ppc64le-redhat-linux/12/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: ppc64le-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-targets=powerpcle-linux --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-libstdcxx-backtrace --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-12.2.1-20220819/obj-ppc64le-redhat-linux/isl-install --enable-offload-targets=nvptx-none --without-cuda-driver --enable-offload-defaulted --enable-gnu-indirect-function --enable-secureplt --with-long-double-128 --with-long-double-format=ieee --with-cpu-32=power8 --with-tune-32=power8 --with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux --with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 12.2.1 20220819 (Red Hat 12.2.1-1) (GCC)

@martin-frbg
Copy link
Collaborator

@RajalakshmiSR distributors could apply the patch (that one gets by adding ".diff" to the end of #3718 's github url) for the time being. I do not think re-releasing 0.3.21 with that fix included would make sense on my side - maybe if the NumPy crowd keep coming up with Apple M1 bugs there will be a 0.3.22 relatively soon

@aekoroglu
Copy link

We successfully compiled 0.3.21 for ppc64 with #3718 and released https://koji.fedoraproject.org/koji/buildinfo?buildID=2052854. Thank you @martin-frbg and @hhorak

@martin-frbg
Copy link
Collaborator

so the strange ISMIN fault you saw is gone again as well ?

@aekoroglu
Copy link

Yes we're good now :)

@martin-frbg
Copy link
Collaborator

Ok, thanks for confirming :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants