Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sage miscompiles with gcc9 #27676

Closed
Konrad127123 opened this issue Apr 16, 2019 · 47 comments
Closed

sage miscompiles with gcc9 #27676

Konrad127123 opened this issue Apr 16, 2019 · 47 comments

Comments

@Konrad127123
Copy link

gcc-9 is due out in late April/early May 2019. With the current gcc-9 snapshot, sage has issues where it miscompiles.

There seem to be two sets of issues:

  1. openblas-0.3.5 doesn't compile correctly with gcc-9. This makes lots of errors occur if SAGE_CHECK="yes" (in openblas and other packages that depend on it) and makes make testlong output lots of errors and eventually just hang. As of writing (2019/04/16), this seems to be fixed in upstream git.

  2. Something seems to go wrong with brial or something related. make testlong has errors on the following files:

sage -t --long --warn-long 31.0 src/sage/crypto/mq/sr.py  # Killed due to abort
sage -t --long --warn-long 31.0 src/sage/rings/polynomial/multi_polynomial_sequence.py  # Killed due to abort
sage -t --long --warn-long 31.0 src/sage/rings/polynomial/pbori.pyx  # Killed due to abort

As a minor issue, with SAGE_CHECK="yes", make doesn't seem to stop after compiling openblas, despite many tests failing.

Latest gcc-9 snapshot as of writing: ftp://gcc.gnu.org/pub/gcc/snapshots/9-20190414/gcc-9-20190414.tar.xz

Current openblas upstream as of writing: https://community.dur.ac.uk/konrad.dabrowski/sage/openblas-0.3.5b.tar.gz

CC: @kiwifb

Component: packages: standard

Keywords: gcc, openblas, brial

Branch/Commit: u/Konrad127123/gcc-9-test2 @ 6a54ee1

Reviewer: Steven Trogdon

Issue created by migration from https://trac.sagemath.org/ticket/27676

@Konrad127123
Copy link
Author

Attachment: openblas-0.3.5.p1.log.gz

Test suite output of openblas with gcc-9

@Konrad127123
Copy link
Author

Attachment: testlong.log.gz

make testlong output with gcc-9 and openblas from upstream git.

@Konrad127123
Copy link
Author

Branch: u/Konrad127123/gcc-9-test2

@kiwifb
Copy link
Member

kiwifb commented Apr 16, 2019

New commits:

884e9daisl patch is in gcc-9
a11398eDisable isl when building gcc to avoid inconsistently building gcc with sage's optional isl package. See also #26735.
2d6dc98Update gcc to the 9-20190414 snapshot
6a54ee1Update OpenBLAS to 2019/04/14

@kiwifb
Copy link
Member

kiwifb commented Apr 16, 2019

Commit: 6a54ee1

@Konrad127123

This comment has been minimized.

@Konrad127123

This comment has been minimized.

@Konrad127123
Copy link
Author

comment:5

FYI: The branch currently attached to this ticket is to help with testing/investigating. It's not an attempt to fix the bugs.

@Konrad127123
Copy link
Author

Attachment: sr.txt.gz

Output of sage -t --long --warn-long 31.0 src/sage/crypto/mq/sr.py run standalone

@Konrad127123
Copy link
Author

Output of sage -t --long --warn-long 31.0 src/sage/rings/polynomial/multi_polynomial_sequence.py run standalone

@Konrad127123
Copy link
Author

Attachment: multi_polynomial_sequence.txt.gz

Output of sage -t --long --warn-long 31.0 src/sage/rings/polynomial/pbori.pyx run standalone

@dimpase
Copy link
Member

dimpase commented May 16, 2019

comment:6

Attachment: pbori.txt.gz

With ModuleNotFoundError: No module named 'Cython', as seen in the logs, all bets are off.

I tried gcc 9.1 (from Gentoo) and got a lot of "Bad exit" in tests from memory overflows.

@kiwifb
Copy link
Member

kiwifb commented May 16, 2019

comment:7

Have you run any tests on individual packages? I would expect a few things to come from individual packages.

@vbraun
Copy link
Member

vbraun commented May 18, 2019

comment:8

Maybe we can upgrade to the released openblas 0.3.6 on a separate ticket to get that out of the way

@vbraun
Copy link
Member

vbraun commented May 18, 2019

comment:9

This is now 27847

@vbraun
Copy link
Member

vbraun commented May 18, 2019

comment:10

With #27847 only the following failures remain

sage -t --long src/sage/crypto/mq/sr.py  # Killed due to abort
sage -t --long src/sage/rings/polynomial/multi_polynomial_sequence.py  # Killed due to abort
sage -t --long src/sage/rings/polynomial/pbori.pyx  # Killed due to abort

@kiwifb
Copy link
Member

kiwifb commented May 18, 2019

comment:11

That's less than I have in sage-on-gentoo but I haven't rebuilt everything yet. But there is really a track leading to pbori not sure if it is just the sage interface or brial itself. brial passes its own testsuite with gcc-9.1 that's already something but not necessarilly conclusive that it is all on sage-side.

From what Steve Trogdon posted to me

Cython backtrace
----------------
#0  0x00007fe418a5fd80 in __waitpid () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-libs/glibc-2.29-r2/work/glibc-2.29/nptl/../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x00007fe4104374eb in print_enhanced_backtrace () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:563
#2  0x00007fe410437687 in sigdie () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:589
#3  0x00007fe410436a0e in sigdie_for_sig () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:160
#4  0x00007fe410436ba6 in cysigs_signal_handler () at /storage/strogdon/gentoo-rap/var/tmp/portage/dev-python/cysignals-1.10.2/work/cysignals-1.10.2/build/src/cysignals/implementation.c:262
#5  0x00007fe4190cea30 in __restore_rt ()
#6  0x00007fe4186b6330 in __GI_raise () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-libs/glibc-2.29-r2/work/glibc-2.29/signal/../sysdeps/unix/sysv/linux/raise.c:50
#7  0x00007fe41869f7a4 in __GI_abort () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-libs/glibc-2.29-r2/work/glibc-2.29/stdlib/abort.c:79
#8  0x00007fe401744d20 in __gnu_cxx::__verbose_terminate_handler () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/vterminate.cc:95
#9  0x00007fe401742dc0 in __cxxabiv1::__terminate () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#10 0x00007fe401742e00 in std::terminate () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#11 0x00007fe401743000 in __cxxabiv1::__cxa_throw () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/libsupc++/eh_throw.cc:95
#12 0x00007fe401715ac6 in std::__throw_length_error () at /storage/strogdon/gentoo-rap/var/tmp/portage/sys-devel/gcc-9.1.0/work/gcc-9.1.0/libstdc++-v3/src/c++11/functexcept.cc:78
#13 0x00007fe3aa2e6ea2 in std::deque<polybori::CCuddNavigator, std::allocator<polybori::CCuddNavigator> >::_M_new_elements_at_back () at /storage/strogdon/gentoo-rap/usr/lib64/gcc/x86_64-pc-linux-gnu/9.1.0/include/g++-v9/bits/stl_deque.h:2191
  2186           *  Makes sure the _M_map has space for new nodes.  Does not
  2187           *  actually add the nodes.  Can invalidate _M_map pointers.
  2188           *  (And consequently, %deque iterators.)
  2189           */
  2190          void
> 2191          _M_reserve_map_at_back(size_type __nodes_to_add = 1)
  2192          {
  2193          if (__nodes_to_add + 1 > this->_M_impl._M_map_size
  2194              - (this->_M_impl._M_finish._M_node - this->_M_impl._M_map))
  2195            _M_reallocate_map(__nodes_to_add, false);

I'd say there is something to fix in Cudd inside brial.

@strogdon
Copy link

comment:12

I should have posted this elsewhere.

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:13

Brial testsuite passes for me, too (gcc 9.1.1 / Fedora 30)

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:14

I think the culprit is in a higher stack frame, presumably std::deque runs out of memory because an invalid size is passed in. In particular, this looks fishy:

#10 pbori_cuddInitCache (unique=unique@entry=0x7f8185b293f0, 
    cacheSize=cacheSize@entry=262144, maxCacheSize=699050, 
    maxCacheSize@entry=<error reading variable: DW_OP_const_type has different sizes for type and data>) at cuddCache.c:152
        i = <optimized out>
        logSize = 18
        mem = <optimized out>
        offset = <optimized out>
        __PRETTY_FUNCTION__ = "pbori_cuddInitCache"

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:15

There is quite a fat memory leak, e.g.

sage: for i in range(300):
....:     B = BooleanPolynomialRing(5,'x{}'.format(i))

eats 5GB. Changing the memory limit, e.g.

./sage -t --memlimit=2000 src/sage/rings/polynomial/pbori.pyx

changes where the error occurs

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:16

Disabling the memlimit lets me complete

sage -t --memlimit=0 src/sage/rings/polynomial/pbori.pyx

but does eat much more memory than before. Doing it with the other tests eats all available memory

@dimpase
Copy link
Member

dimpase commented May 19, 2019

comment:17

Did you try going from -std=gnu++11 to -std=c++11 ? This can be done by
applying

--- a/build/pkgs/gcc/spkg-configure.m4
+++ b/build/pkgs/gcc/spkg-configure.m4
@@ -94,7 +94,7 @@ SAGE_SPKG_CONFIGURE([gcc], [
         IS_REALLY_GCC=yes
     fi
 
-    AX_CXX_COMPILE_STDCXX_11([], optional)
+    AX_CXX_COMPILE_STDCXX_11([noext], optional)
     if test $HAVE_CXX11 != 1; then
         SAGE_MUST_INSTALL_GCC([your C++ compiler does not support C++11])
     fi

@dimpase
Copy link
Member

dimpase commented May 19, 2019

comment:18

as well, our boost (and pbori uses it) is 2.5 y.o. version 1.66, while the current one is 1.70.

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:19

Brial's reference counting implementation miscompiles with gcc9, and the branch where the refcount falls back to zero is never executed. Hence the ram usage ballooning. I've opened an upstream bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90535

@dimpase
Copy link
Member

dimpase commented May 19, 2019

comment:20

it is insane that this kind of address arithmetic is invalid C, as they imply on the ticket. Where is it said in the standard?

@kiwifb
Copy link
Member

kiwifb commented May 19, 2019

comment:21

I don't think they imply this is invalid C. But this is fairly rare occurrence and, if I am reading correctly, accounting for it everywhere means some optimisation opportunities are lost. As you'll note, Volker said the example works at -O0. This is from GNU gcc devs from whom I learned that pessimized was a word (at least they use it).

I should be able to do a micro release of brial to include that option this week. PR welcome if you feel faster. I may also check if I can safely update the cudd code in brial to a more recent version. It may be that newer cudd doesn't exhibit the problem.

@Volker in which brial file does the incriminated code occur?

@dimpase
Copy link
Member

dimpase commented May 19, 2019

comment:22

OK, if they willingly mis-compile valid code for the sake of optimisation, they should at least print fat warnings about code branches they discard...

@strogdon
Copy link

comment:23

Replying to @kiwifb:

...

I should be able to do a micro release of brial to include that option this week. PR welcome if you feel faster. I may also check if I can safely update the cudd code in brial to a more recent version. It may be that newer cudd doesn't exhibit the problem.
...

Perhaps I misunderstand but my s-o-g brial build uses -O0; in fact, the system is built that way. And the tests still fail.

@kiwifb
Copy link
Member

kiwifb commented May 19, 2019

comment:24

OK, Volker's example works at -O0 may be our real life code is a bit more complicated. Could you compile brial with -fno-delete-null-pointer-checks in the CFLAGS (and CXXFLAGS too for good measure).

@strogdon
Copy link

comment:25

Replying to @kiwifb:

OK, Volker's example works at -O0 may be our real life code is a bit more complicated. Could you compile brial with -fno-delete-null-pointer-checks in the CFLAGS (and CXXFLAGS too for good measure).

I still get the 3 failures, brial built with

CFLAGS="-march=native -O0 -fno-delete-null-pointer-checks -pipe -g -ggdb"
CXXFLAGS="${CFLAGS}"

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:26

Reading more into this, C++11 Standard 5.7.5 says that pointer arithmetic on a pointer that is not pointing into an element of an array is undefined behavior. So in particular the compiler is free to assume that --ptr cannot be NULL, since you can't have allocated an array that starts at 0x1.

The offending code is

/// Release current pointer by decrementing reference counting
inline void 
intrusive_ptr_release(PBORI_PREFIX(DdManager)* ptr) {
   if (!(--(ptr->hooks))) {
 
    PBORI_ASSERT(PBORI_PREFIX(Cudd_CheckZeroRef)(ptr) == 0);
    PBORI_PREFIX(Cudd_Quit)(ptr);
  }
}

Really we should just change DdManager.hooks to be an std::size_t instead of char*, why on earth are you using pointer arithmetic to count in the first place. I think thats going to be a simple change since its only used in one or two places.

These are inlined headers so you also need to rebuild the sage library....

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:27

Indeed, this succeeds for me:

export CFLAGS='-fno-delete-null-pointer-checks'
export CXXFLAGS='-fno-delete-null-pointer-checks'
./sage -p brial
touch src/sage/rings/polynomial/pbori.pyx 
./sage -b
./sage -t --long src/sage/rings/polynomial/pbori.pyx
./sage -t --long src/sage/crypto/mq/sr.py
./sage -t --long src/sage/rings/polynomial/multi_polynomial_sequence.py

@kiwifb
Copy link
Member

kiwifb commented May 19, 2019

comment:28

OK I thought it was in the C code from Cudd. But this is the C++ interface to Cudd from brial. That code is in libbrial/include/polybori/rings/CCuddInterface.h. There is some more suspicious code in libbrial/include/polybori/rings/CCuddCore.h

/// Release this by decrementing reference counting
  refcount_type release() {
    return (--ref);
  }

@vbraun
Copy link
Member

vbraun commented May 19, 2019

comment:29

refcount_type is just a typedef for std::size_t so thats perfect.

Since we don't want to change the DdManager struct I guess we have to allocate a value for DdManager.hooks.

@kiwifb
Copy link
Member

kiwifb commented May 19, 2019

comment:30

If you send a pull request I can make a quick release of brial.

@strogdon
Copy link

comment:31

Replying to @vbraun:

Indeed, this succeeds for me:

export CFLAGS='-fno-delete-null-pointer-checks'
export CXXFLAGS='-fno-delete-null-pointer-checks'
./sage -p brial
touch src/sage/rings/polynomial/pbori.pyx 
./sage -b
./sage -t --long src/sage/rings/polynomial/pbori.pyx
./sage -t --long src/sage/crypto/mq/sr.py
./sage -t --long src/sage/rings/polynomial/multi_polynomial_sequence.py

OK, rebuilding Sage also with the new CFLAGS does work here. I never thought about the headers.

@dimpase
Copy link
Member

dimpase commented May 20, 2019

comment:32

By the way, perhaps it's time to drop gcc 4. Indeed, 4.9.4 was the last 4.* release, and it happened in 2016.

@kiwifb
Copy link
Member

kiwifb commented May 21, 2019

comment:33

I am now using the following if gcc-9.1 is present in sage-on-gentoo

diff --git a/sage/libs/polybori/decl.pxd b/sage/libs/polybori/decl.pxd
index dd6a3aa..edb7bb7 100644
--- a/sage/libs/polybori/decl.pxd
+++ b/sage/libs/polybori/decl.pxd
@@ -1,5 +1,5 @@
 # distutils: language = c++
-# distutils: extra_compile_args = -std=c++11
+# distutils: extra_compile_args = -std=c++11 -fno-delete-null-pointer-checks
 
 from libcpp.string cimport string as std_string
 from libcpp.vector cimport vector

It is possibly over-broad.

@vbraun
Copy link
Member

vbraun commented Jun 5, 2019

comment:34

PR at BRiAl/BRiAl#35

@kiwifb
Copy link
Member

kiwifb commented Jun 6, 2019

comment:35

New brial release https://github.com/BRiAl/BRiAl/releases/download/1.2.5/brial-1.2.5.tar.bz2 with the PR included. Do we create a separate ticket for the upgrade?

@vbraun
Copy link
Member

vbraun commented Jun 6, 2019

comment:36

Please do make a ticket for that!

@kiwifb
Copy link
Member

kiwifb commented Jun 6, 2019

comment:37

#27942 is ready for review for brial.

@embray
Copy link
Contributor

embray commented Jun 14, 2019

comment:38

As the Sage-8.8 release milestone is pending, we should delete the sage-8.8 milestone for tickets that are not actively being worked on or that still require significant work to move forward. If you feel that this ticket should be included in the next Sage release at the soonest please set its milestone to the next release milestone (sage-8.9).

@embray embray removed this from the sage-8.8 milestone Jun 14, 2019
@kiwifb
Copy link
Member

kiwifb commented Aug 8, 2019

comment:39

I think we pretty much fixed this by opening separate tickets for each separate problems. So now, this meta ticket should be closed.

@strogdon
Copy link

Reviewer: Steven Trogdon

@strogdon
Copy link

comment:40

I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants