Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault related to scheme_intern_exact_symbol #2882

Open
elfprince13 opened this issue Oct 30, 2019 · 31 comments
Open

Segfault related to scheme_intern_exact_symbol #2882

elfprince13 opened this issue Oct 30, 2019 · 31 comments
Assignees

Comments

@elfprince13
Copy link
Contributor

I'm trying to debug what I think is an unrelated issue, and Racket 7.4 is doing something naughty with memory that makes gdb and valgrind give up before I get to the bug I want to be looking at. Strangely, it appears to run fine when not hooked up to a debugging tool, so I don't know if that means its just catching its own segfault and moving on or what.

The only thing recognizable in the stack trace from valgrind is scheme_intern_exact_symbol.

[thomas@some-machine] builddir $ valgrind /deps/racket/racket/bin/racket
==24056== Memcheck, a memory error detector
==24056== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==24056== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==24056== Command: /deps/racket/racket/bin/racket
==24056==
Welcome to Racket v7.4.
==24056== Invalid write of size 8
==24056==    at 0x51DB7CE: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51DC759: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51DC7D0: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51DC7FF: scheme_intern_exact_symbol (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x516836C: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51685B8: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51685B8: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x5165DF8: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x5165F9D: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51660BD: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x516568C: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x51666C4: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==  Address 0xd38 is not stack'd, malloc'd or (recently) free'd
==24056==
SIGSEGV MAPERR si_code 1 fault on addr 0xd38
==24056==
==24056== Process terminating with default action of signal 6 (SIGABRT)
==24056==    at 0x561CE97: raise (raise.c:51)
==24056==    by 0x561E800: abort (abort.c:79)
==24056==    by 0x523E16D: fault_handler (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==    by 0x561CF1F: ??? (in /lib/x86_64-linux-gnu/libc-2.27.so)
==24056==    by 0x51DB7CD: ??? (in /deps/racket/racket/lib/libracket3m-7.4.so)
==24056==
==24056== HEAP SUMMARY:
==24056==     in use at exit: 8,675,309 bytes in 1,958 blocks
==24056==   total heap usage: 1,975 allocs, 17 frees, 8,693,826 bytes allocated
==24056==
==24056== LEAK SUMMARY:
==24056==    definitely lost: 81,920 bytes in 1 blocks
==24056==    indirectly lost: 0 bytes in 0 blocks
==24056==      possibly lost: 288 bytes in 1 blocks
==24056==    still reachable: 8,593,101 bytes in 1,956 blocks
==24056==         suppressed: 0 bytes in 0 blocks
==24056== Rerun with --leak-check=full to see details of leaked memory
==24056==
==24056== For counts of detected and suppressed errors, rerun with: -v
==24056== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Aborted

If it's helpful, system specs are included:

[thomas@some-machine] builddir $ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.2 LTS
Release:	18.04
Codename:	bionic
[thomas@some-machine] builddir $ gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[thomas@some-machine] builddir $ valgrind --version
valgrind-3.13.0
@samth
Copy link
Sponsor Member

samth commented Oct 31, 2019

This is likely to be the use of the SIGSEGV handler for the GC write barrier. When starting GDB on Racket, you should do handle SIGSEGV nostop noprint.

@elfprince13
Copy link
Contributor Author

This is likely to be the use of the SIGSEGV handler for the GC write barrier.

Is there some good documentation on the GC write barrier I could look at? It would be nice to have a clearer idea of what's happening under the hood.

When starting GDB on Racket, you should do handle SIGSEGV nostop noprint.

Thanks, hopefully this won't mask too many errors in my own code.

@elfprince13
Copy link
Contributor Author

Actually - is it possible that Racket's SIGSEGV handler is masking a SIGSEGV elsewhere in a program that uses embedded Racket, and this is causing my program to hang instead of crash and generate a backtrace?

@samth
Copy link
Sponsor Member

samth commented Oct 31, 2019

@elfprince13 that shouldn't happen -- I've had plenty of segfaults that I've debugged successfully.

I don't think there's any regular Racket documentation of the signal handling behavior, but the GC approach in traditional Racket is described in these two papers: https://www.cs.utah.edu/plt/publications/ismm04-wf.pdf https://www.cs.utah.edu/plt/publications/ismm09-rwrf.pdf

@elfprince13
Copy link
Contributor Author

Thanks, I'll take a look =) I think something funky is going on, because I added some code to swap out the Racket segfault handler while my interpreter coroutine is suspended, and now it crashes instead of hanging, but earlier. It just occurred to me that if the GC is running in a separate thread, and kept chugging along while the interpreter was suspended, that it probably still needed its own handler in place.

@samth
Copy link
Sponsor Member

samth commented Oct 31, 2019

The GC doesn't run in a different thread. You could try disabling GC and see if that helps.

@mflatt
Copy link
Member

mflatt commented Oct 31, 2019

You can configure with --disable-generations to avoid a signal-based write barrier.

@elfprince13
Copy link
Contributor Author

Okay, so, --disable-generations made my embedded code run without any problems at all.

I'm happy to just do that for now, although it would be nice to figure out why, and see if I can develop a work-around. I can't post any source code, but roughly, I have a C++ coroutine implementation based on Boost::Context, that puts the Racket interpreter on its own stack, and lets me yield in and out of a file-loading loop for my EDSL. Any thoughts on why that would interact negatively with the generational collector would be much appreciated.

One specific question: is the default setting of --enable-generations vs --disable-generations platform specific? My code has been working fine on macOS Mojave, and breaking on Ubuntu, and I had no idea why, so I'm wondering if --disable-generations was set by default on macOS.

@samth
Copy link
Sponsor Member

samth commented Oct 31, 2019

You'll have to wait for @mflatt to provide helpful advice, but --enable-generations is the default on all platforms.

@mflatt
Copy link
Member

mflatt commented Oct 31, 2019

Mac OS is different because the fault is caught at the Mach layer instead of the BSD-impersonation layer.

My only guess about why coroutines would interfere is that stack use by a signal handler might go wrong somehow.

@elfprince13
Copy link
Contributor Author

Mac OS is different because the fault is caught at the Mach layer instead of the BSD-impersonation layer.

Interesting, thanks!

Okay, so, --disable-generations made my embedded code run without any problems at all.

Apparently this was premature. Whether or not my code works now depends on whether or not Racket is compiled with optimization, and I my previous test with --disable-generations was also using CFLAGS=-Og -ggdb. Recompiling with the default CFLAGS causes it to break again. I suspect this means there's some undefined behavior in Racket's C source.

When Racket is compiled with the default flags, valgrind reports a number of errors related to uninitialized values in scheme_native_stack_trace() (see attached log excerpt: valgrind_excerpt.txt). Unfortunately GCC doesn't report warnings for uninitialized values when compiling with higher levels of optimization, but if I switch back to -Og, I see:

./jitstack.c:171:33: warning: variable 'set_cache_sp' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./jitstack.c:173:25: warning: variable 'last' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./jitstack.c:178:7: warning: variable 'manual_unw' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./jitstack.c:183:7: warning: variable 'shift_cache_to_next' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./jitstack.c:185:7: warning: variable 'unsuccess' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./../src/jitalloc.c:45:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
./../src/jitcommon.c:1469:3609: warning: this statement may fall through [-Wimplicit-fallthrough=]
./../src/jitcommon.c:2191:2671: warning: 'prim_other_type' may be used uninitialized in this function [-Wmaybe-uninitialized]
./../src/jitstack.c:178:7: warning: variable 'manual_unw' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./../src/jitstack.c:183:7: warning: variable 'shift_cache_to_next' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./../src/jitstack.c:185:7: warning: variable 'unsuccess' might be clobbered by 'longjmp' or 'vfork' [-Wclobbered]
./../src/jitstate.c:17:75: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
./../src/jitstate.c:21:81: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]

I'm going to try compiling with --disable-jit to see if that fixes things on my side.

@pmatos
Copy link
Collaborator

pmatos commented Nov 6, 2019

@elfprince13 what's the status of this bug? Have you managed to get to some kind of conclusion? Also, is there any way you can provide a reproducible example for us to try on our end?

@elfprince13
Copy link
Contributor Author

@pmatos - Thanks for checking in.

The title should probably be something different now; however, my code now works correctly when Racket is compiled with --disable-jit --disable-generations, or when Racket is compiled with CFLAGS="-Og -ggdb" , which makes me believe that there is some undefined behavior in the Racket code which is being interpreted the wrong way by malevolent spirits gcc's optimizer.

The valgrind log + gcc warnings in my previous comment suggest a starting place for tracking it down; however, I'll see if I can extract a reproducible example.

@pmatos
Copy link
Collaborator

pmatos commented Nov 7, 2019

@elfprince13 If you can get the example, great. If not, can you please compile with --enable-ubsan and cause the crash? Let me know if the undefined behaviour sanitizer catches anything - it should print something to stdout.

@elfprince13
Copy link
Contributor Author

Annoyingly, I'm having difficulty reproducing the specific conditions that were causing our problem (make clean in racket-7.4/src reports a number of errors that I was ignoring, which makes me wonder if I'm getting nondeterministic results); however, in an attempt to further provoke the malevolent spirits gcc's optimizer, compiling with:

CFLAGS="-O3" ./configure --enable-shared --enable-jit --enable-generations

results in racket/racket3m hanging during the make install process.

CFLAGS="-O3" ./configure --enable-shared --enable-jit --enable-generations --enable-ubsan

results in racket/racket3m terminating in a bus error during the make install process.

@elfprince13
Copy link
Contributor Author

(note: if I rerun make install several times, the --enable-ubsan version sometimes bus errors, sometimes segfaults, and sometimes runs to completion)

@pmatos
Copy link
Collaborator

pmatos commented Nov 7, 2019

@elfprince13 is it possible you have a hardware problem? Can you try running a disk checker and memory test to rule those issues out?

I have seen those non determinism occur with hardware issues. Also, I compile this config on Linux on a regular basis without problems.

Alternatively, can you reproduce the problem on a different pc?

@elfprince13
Copy link
Contributor Author

elfprince13 commented Nov 7, 2019

Was able to replicate the make install bus error described above, using two AWS virtual machines created from the Ubuntu 18.04 AMI in two different availability zones (to ensure they weren't somehow getting created on the same faulty machine), both with a t2.large instance.

After booting:

sudo apt-get update
sudo apt install gcc make
mkdir usr
wget "https://mirror.racket-lang.org/installers/7.4/racket-minimal-7.4-src-builtpkgs.tgz"
tar xf racket-minimal-7.4-src-builtpkgs.tgz
pushd racket-7.4/src
CFLAGS="-O3 -Wextra -Wno-unused-parameter" ./configure --enable-shared --enable-jit --enable-generations --enable-ubsan --prefix=/home/ubuntu/usr
make
make install

The installed gcc version is still 7.4.0.

Final output from make install is

racket/racket3m -X "/home/ubuntu/usr/share/racket/collects" -G "/home/ubuntu/usr/etc/racket"    --no-user-path -N "raco" -l- setup --no-user
raco setup: version: 7.4
raco setup: platform: x86_64-linux [3m]
raco setup: target machine: racket
raco setup: installation name: 7.4
raco setup: variants: 3m
raco setup: main collects: /home/ubuntu/usr/share/racket/collects
raco setup: collects paths:
raco setup:   /home/ubuntu/usr/share/racket/collects
raco setup: main pkgs: /home/ubuntu/usr/share/racket/pkgs
raco setup: pkgs paths:
raco setup:   /home/ubuntu/usr/share/racket/pkgs
raco setup:   /home/ubuntu/.racket/7.4/pkgs
raco setup: links files:
raco setup:   /home/ubuntu/usr/share/racket/links.rktd
raco setup: main docs: /home/ubuntu/usr/share/doc/racket
raco setup: --- updating info-domain tables ---                    [22:58:17]
raco setup: --- pre-installing collections ---                     [22:58:17]
raco setup: --- installing foreign libraries ---                   [22:58:17]
raco setup: --- installing shared files ---                        [22:58:17]
raco setup: --- compiling collections ---                          [22:58:17]
raco setup: making: <collects>/racket
Bus error (core dumped)
Makefile:197: recipe for target 'install-3m' failed
make[1]: *** [install-3m] Error 135
make[1]: Leaving directory '/home/ubuntu/racket-7.4/src'
Makefile:119: recipe for target 'install' failed
make: *** [install] Error 2

@pmatos
Copy link
Collaborator

pmatos commented Nov 8, 2019

@elfprince13 thank you very much. I can repro locally in a docker container. I will take a look at this and see how far I get.

@elfprince13
Copy link
Contributor Author

Hey @pmatos, I just wanted to check if you'd had a chance to look into this yet.

@pmatos
Copy link
Collaborator

pmatos commented Jan 10, 2020

Hey @pmatos, I just wanted to check if you'd had a chance to look into this yet.

Sorry @elfprince13, dropped the ball on this one. Let me get back to you in a few hours.

@pmatos
Copy link
Collaborator

pmatos commented Jan 10, 2020

@elfprince13 I haven't reached any conclusion but here are some findings that might prove useful if you need a workaround.

  • It builds fine with -O2
  • It builds fine with gcc 8.3.0
  • It is dependent on a combination of flags to gcc that are included in -O3 but not -O2. I am trying to understand which flags exactly but the smallest command line I found that fails with gcc 7.4.0 is (run from racket/racket/src, not top-level make):
$ CFLAGS="-O2 -finline-functions -fgcse-after-reload -ftree-loop-vectorize -ftree-loop-distribute-patterns -fsplit-paths -fipa-cp-clone" ./configure --enable-shared --enable-jit --enable-generations --enable-ubsan
$ make -j40
$ make -j40 install
  • --enable-jit and --enable-generations are the default, I just add them to be explicit.
  • --enable-ubsan is not necessary but forces a crash, otherwise you will just get a hang.
  • If you build with --enable-racket=<recent snapshot>, it seems to work fine.

I have a few theories - I will keep looking at this. In the meantime, I am also bisecting gcc, in order to understand at which point gcc started compiling racket just fine.

@pmatos
Copy link
Collaborator

pmatos commented Jan 13, 2020

Further comments on this:

Because this is a default parameter change, I confirmed this by using the new parameter changes.
In particular, the relevant parameter here is inline-min-speedup. So:

COMPILES RACKET

$ CC=gcc-7 ./configure CFLAGS="-O3 --param inline-min-speedup=15" --enable-shared --enable-jit --enable-generations && make -j50
$ CC=gcc-8 ./configure CFLAGS="-O3 --param inline-min-speedup=15" --enable-shared --enable-jit --enable-generations && make -j50

Default value of inline-min-speedup for gcc 8.3.0 is 15.

BREAKS

$ CC=gcc-8 ./configure CFLAGS="-O3 --param inline-min-speedup=8" --enable-shared --enable-jit --enable-generations && make -j50
$ CC=gcc-7 ./configure CFLAGS="-O3 --param inline-min-speedup=8" --enable-shared --enable-jit --enable-generations && make -j50

Default value of gcc 7.4.0 for inline-min-speedup is 8.

This a bug in gcc's inliner - not racket. I will keep this open until I open a bug on gcc side but I need to ensure it's not yet fixed in gcc's tip of master.

@pmatos
Copy link
Collaborator

pmatos commented Jan 13, 2020

I also meant to say that the gcc change was not a fix. Increasing the inlining requirements meant that the inliner didn't inline the case where it breaks on gcc 8 and therefore hid the bug revealed by gcc 7.

You can see this by setting inline-min-speedup to 8 in gcc8 and later.

Since this is an inliner bug, you can also compile racket flawlessly by using -fno-inline. Obviously it's not great but shows where the problem is.

@mflatt
Copy link
Member

mflatt commented Jan 13, 2020

Sounds like there's still the possibility that the Racket source does something unspecified, where inlining reveals that enough that the compiler takes advantage of the lack of specification.

@elfprince13
Copy link
Contributor Author

Yeah, I'm curious if we can make gcc log the inlining decisions it made (using, e.g. -fopt-info-inline-optimized-missed=inline.txt) and diff those logs to narrow the region of the codebase that's being optimized in a problematic way and see if there's something fishy going on in the code.

@pmatos
Copy link
Collaborator

pmatos commented Jan 13, 2020 via email

@pmatos
Copy link
Collaborator

pmatos commented Jan 13, 2020 via email

@pmatos
Copy link
Collaborator

pmatos commented Jan 14, 2020

GCC head is working so either the defaults changed between 8.3.0 and now, or a fix went in. I will have to bisect.

@pmatos
Copy link
Collaborator

pmatos commented Jan 14, 2020

... or, of course, there is an undefined behaviour in racket and the way gcc handles it changed so that it now works. :)

Given I have seen no undefined behaviour reported by ubsan, I prefer to think that's not the case.

@pmatos
Copy link
Collaborator

pmatos commented Jan 15, 2020

GCC head is working so either the defaults changed between 8.3.0 and now, or a fix went in. I will have to bisect.

Forgot to mention - GCC head (e2346a33b) is not working after all. I simply made a mistake on the command line when testing. I am looking for the culprit spot where the inliner fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants