Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults #14

Closed
blakejohnson opened this issue May 17, 2013 · 93 comments
Closed

Segfaults #14

blakejohnson opened this issue May 17, 2013 · 93 comments

Comments

@blakejohnson
Copy link
Contributor

I keep running into an issue where running @sprofile on the same function several times causes Julia to segfault. I am not even sure where to start in finding the origin of this issue. I am running Mac OS 10.7.5. I rebuild Julia daily, and this problem has existed at least for a few weeks. I also rebuilt the Profile.jl library with Pkg.runbuildscript("Profile").

If it would help to post the code I am running, I would be happy to do so.

@timholy
Copy link
Owner

timholy commented May 17, 2013

Thanks for the report. I've never seen this, so if you really have no objections it would be useful for me to try your code. It sounds like you've tried the obvious things. Did it used to work without a segfault, or is it simply that you've only been using @sprofile in this fashion recently?

This might be an OSX vs Linux thing, of course. Since I was writing it on Linux I went with its most modern timer API, then later learned that it's not available on BSD. But Linux can also use the same older timer API as OSX. So if I can't reproduce the segfault I'll compile a special version and test the OSX timers. I did that when I first developed the OSX support, but I bet I didn't give it the same kind of thorough workout you're giving it.

@blakejohnson
Copy link
Contributor Author

Sure, the code is here: https://gist.github.com/blakejohnson/5600044

Running this causes my machine to segfault:

require("SimSpeedTest.jl")
require("Profile")
using SProfile
@sprofile fot ct=1:10 run_sim(); end

@timholy
Copy link
Owner

timholy commented May 17, 2013

I'll definitely have to try the BSD timers. I was able to run that last line ~6 times without issue.

Hopefully this weekend.

@blakejohnson
Copy link
Contributor Author

I'll also have a look on another computer to see if it is something
peculiar to my work machine.

--Blake

On Fri, May 17, 2013 at 3:37 PM, Tim Holy notifications@github.com wrote:

I'll definitely have to try the BSD timers. I was able to run that last
line ~6 times without issue.

Hopefully this weekend.


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-18081921
.

@blakejohnson
Copy link
Contributor Author

FWIW, here is an example backtrace upon segfault:

#0  0x00007fff8916a2b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1  0x00007fff8916a1fd in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2  0x00007fff8916a171 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3  0x0000000100097a66 in rec_backtrace ()
#4  0x00000001018f2d18 in sprofile_bt ()
#5  <signal handler called>
#6  0x00000001048259d1 in zhemv_ ()
#7  0x00000001050357a7 in zhetd2_ ()

@blakejohnson
Copy link
Contributor Author

Ok, I can confirm the issue on another Mac, this one with OS 10.8.2. The segfault backtrace is basically the same.

@timholy
Copy link
Owner

timholy commented May 18, 2013

OK, I commented out the Linux timers and rebuilt. I still don't get the segfault. Then I ran it under valgrind as
valgrind --track-origins=yes /home/tim/src/julia/usr/bin/julia-debug-readline.

I got these errors:
==1537== Thread 2:
==1537== Syscall param msync(start) points to unaddressable byte(s)
==1537== at 0x50684CD: ??? (syscall-template.S:82)
==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67A2C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67CEA: _ULx86_64_access_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E675AA: _ULx86_64_get_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6C9FB: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x53A5987: rec_backtrace (task.c:506)
==1537== Address 0xd99d000 is not stack'd, malloc'd or (recently) free'd
==1537==
==1537== Syscall param msync(start) points to uninitialised byte(s)
==1537== at 0x50684CD: ??? (syscall-template.S:82)
==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6AB8C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6CC5A: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x53A5987: rec_backtrace (task.c:506)
==1537== by 0xE3FE8B7: sprofile_bt (profile.c:27)
==1537== by 0x646E49F: ??? (in /lib/x86_64-linux-gnu/libc-2.15.so)
==1537== Address 0xd99e000 is on thread 2's stack
==1537== Uninitialised value was created by a stack allocation
==1537== at 0x53A58E1: rec_backtrace (task.c:499)
==1537==

See also:
JuliaLang/julia#782

I don't know what to do here. Looks like it could be a bug in libunwind. CCing @JeffBezanson.

@blakejohnson
Copy link
Contributor Author

I also get the feeling that it is a libunwind bug.

@blakejohnson
Copy link
Contributor Author

On the Mac, valgrind crashes before it finishes loading the Julia REPL. So, I can't show similar output...

@timholy
Copy link
Owner

timholy commented May 21, 2013

Hmm, that's not good!

Sounds like one of us should submit a bug report to libunwind. Do you do C?

@blakejohnson
Copy link
Contributor Author

Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?

Ugh.

@JeffBezanson
Copy link

I believe the reason libunwind is calling msync here is just to check its return code to see if a memory region is valid. So it is expected that it will sometimes pass invalid addresses to msync.

@timholy
Copy link
Owner

timholy commented May 21, 2013

Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?

Dunno. I assume we get "ours" straight from libunwind? In that case, I'd
consider starting with libunwind and see where that gets us.

@timholy
Copy link
Owner

timholy commented May 21, 2013

@JeffBezanson, thanks for checking. Obviously it might be more informative to run it on the mac, if valgrind weren't crashing.

@blakejohnson
Copy link
Contributor Author

It turns out that we only build libunwind on linux and FreeBSD. So, this must be a bug in Apple's implementation. I think Apple still runs their radar bug tracker. So, I guess I will file there.

@timholy
Copy link
Owner

timholy commented May 27, 2013

Thanks for continuing to pursue this! Profile users will owe you their thanks.

@timholy
Copy link
Owner

timholy commented May 28, 2013

CC @ViralBShah.

@ViralBShah
Copy link

We can certainly build a libunwind on OS X, if there is a patched version that works.

@blakejohnson
Copy link
Contributor Author

So, I've been looking more into this over the last few days. It's difficult to debug because libunwind is very low-level. I've only really unearthed two new pieces of information:

  1. The crash point always has OpenBLAS on the stack.
  2. The few other similar reports I can find suggest that the problem could be using inconsistent versions of libstdc++. So, I wondered if gfortran is linking to a different libstdc than clang.

@timholy
Copy link
Owner

timholy commented Jun 4, 2013

That does sound like progress. Viral, you know a lot about Julia's build process, is this theory likely?

@ViralBShah
Copy link

I do not have any more insight here. Since gfortran is provided by the user, I can imagine that different builds link to different libraries. I personally have now moved towards using brew for all the dependencies to build julia. OpenBLAS and gfortran do not seem to be linking to anything weird though:

 otool -L libopenblas.dylib 
libopenblas.dylib:
    @rpath/libopenblas.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
    /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libquadmath.0.dylib (compatibility version 1.0.0, current version 1.0.0)

OpenBLAS does lots of tricks under the hood, and perhaps @xianyi can tell us if it is possible that it is likely to interfere with profiling.

Also, the last I had checked (probably a year ago, when we got backtrace), the upstream libunwind does not build on the mac, and you have to use Apple's. Perhaps that has changed now.

@ViralBShah
Copy link

I notice that libprofile is built on gcc with mac, instead of with clang. Could this be an issue?

@ViralBShah
Copy link

I built with clang and still get the segfault.

@blakejohnson
Copy link
Contributor Author

If you then do otool -L /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib you will notice that it links to '/usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgcc_s.1.dylib'. Whereas libjulia-release.dylib links to /usr/lib/libstdc++.6.dylib. Is this a problem?

I'll check what this looks like on linux.

@blakejohnson
Copy link
Contributor Author

On Ubuntu 12.04:

ldd libopenblas.so
    linux-gate.so.1 =>  (0xb76f8000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb6a02000)
    libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb69e7000)
    libgfortran.so.3 => /usr/lib/i386-linux-gnu/libgfortran.so.3 (0xb68ee000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6744000)
    /lib/ld-linux.so.2 (0xb76f9000)
    libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb66cc000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66ae000)

ldd /usr/lib/i386-linux-gnu/libgfortran.so.3
    linux-gate.so.1 =>  (0xb76ed000)
    libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb7561000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7535000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7516000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb736c000)
    /lib/ld-linux.so.2 (0xb76ee000)

ldd libjulia-release.so 
    linux-gate.so.1 =>  (0xb77bf000)
    libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb681a000)
    librt.so.1 => /lib/i386-linux-gnu/librt.so.1 (0xb6811000)
    libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb67f5000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb670c000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb66e0000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66c2000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6518000)
    /lib/ld-linux.so.2 (0xb77c0000)

So, both libgfortran and libjulia link to libgcc_s.so.1. This is at least different from the mac.

@ViralBShah
Copy link

That doesn't seem to be a problem to me. The gfortran is not provided with XCode and hence it brings its own version of libgcc. libjulia needs libstdc++ which comes from the C toolchain. It seems that the linking and loading is happening correctly.

@blakejohnson
Copy link
Contributor Author

Alright, I guess it is time to look in a different direction. My test script spends most of its time in OpenBLAS, so perhaps it is just a coincidence that that I never seen it crash outside of OpenBLAS. I'll try some of the perf2 demos that are pure Julia and see if I can get the profiler to crash.

@blakejohnson
Copy link
Contributor Author

Well, I cannot get the laplace or Go benchmarks to segfault during profiling. So, perhaps OpenBLAS or ccall are important.

@ViralBShah
Copy link

@xianyi Any wisdom from you here will be useful. This is a blocker for having a sampling profiler as part of julia base, since it crashes when openblas is executing.

@xianyi
Copy link

xianyi commented Jun 6, 2013

Hi @blakejohnson , @ViralBShah ,

Could you try the latest develop branch? We fixed some bugs in level-2 BLAS.

Is it multi-threaded OpenBLAS? Could you try the single thread? Or, build OpenBLAS with USE_OPENMP=1

Xianyi

@ViralBShah
Copy link

Potential culprit: JuliaLang/julia#3365

Perhaps the fortran interface inside OpenBLAS needs to be compiled with -ff2c

@timholy
Copy link
Owner

timholy commented Jun 24, 2013

But the segfaults on OSX are not specific to OpenBLAS. On Linux, the "unrooted" backtraces are also independent of OpenBLAS.

@timholy
Copy link
Owner

timholy commented Jun 24, 2013

Blake, when you link against OpenBLAS, are all the backtraces truncated, or only some of them? For example, I'm guessing if the backtrace is triggered while in main it should not be truncated (capturing samples in main is quite rare, so you may need to run for more samples or run multiple times). Also, looking at your truncated backtraces, I notice there's something yet different from #19: the "head" of the truncated backtrace is not even within OpenBLAS, it's within the signal handler. Presumably unw_step is returning <=0 for the first frame beyond that.

Even though things are still rather confusing, I'm wondering if the time has come to report the issue with the libunwind developers. Perhaps they might immediately know what's happening and help direct our investigations. I'm not sure whether the mailing list or the bug reporter is the better choice. My suspicion is that the mailing list would be better, since I think there's still a lot we don't understand about this issue.

@timholy
Copy link
Owner

timholy commented Jun 24, 2013

Blake, if you also think that the time has come to report it, do you want to do it or should I? I'm happy either way.

@blakejohnson
Copy link
Contributor Author

If you wouldn't mind contacting the libunwind developers, that would be great. They may or may not be able to help given the disconnect between their code and Apple'. But, they can probably point us in productive directions.

I want to try to extend my example a little further to include calls to a simple external library. Then, I think this will be ready to file with Apple.

@blakejohnson
Copy link
Contributor Author

I should mention my null result from last night: I obtained a proper gcc (rather than use the llvm-gcc that Apple provides) and tried building my test with that to at least rule out some varieties of incompatible calling conventions between gfortran and LLVM. No change, regardless of optimization level or fno-omit-frame-pointer.

@blakejohnson
Copy link
Contributor Author

Now I am back to thinking that there is an issue with how OpenBLAS is compiled. I've updated my gist with two new targets:
https://gist.github.com/blakejohnson/5831456

The 4th target uses a simple C version of daxpy. I get full backtraces here.

The 5th target compiles NETLIB's daxpy with gfortran and calls that directly. This also gives me full backtraces.

So, OpenBLAS must be doing something different. @xianyi ?

@xianyi
Copy link

xianyi commented Jun 26, 2013

OpenBLAS uses the assembly kernel for daxpy.

@blakejohnson
Copy link
Contributor Author

I see... these assembly methods manually encode the ABI prologue and epilogue. Hmm. Maybe I should choose a method implemented in C.

@timholy
Copy link
Owner

timholy commented Jun 26, 2013

I was hoping that perhaps we were beginning to discover that assembly methods were lacking unwind information and causing your segfault. But gettimeofday seems to be C. So this might explain the truncated backtraces but not the segfaults.

@timholy
Copy link
Owner

timholy commented Jun 26, 2013

@blakejohnson, can you trigger the segfault from pure-Julia, using the unsafe_iceil example I posted in JuliaLang/julia#3469?

@blakejohnson
Copy link
Contributor Author

After spending a couple hours trying to learn about calling conventions in x86_64, it seems like the assembly methods in OpenBLAS always omit the frame pointer. We could try modifying the definitions of PROLOGUE and EPILOGUE in OpenBLAS/common_x86_64.h to include:

pushq %rbp
movq %rsp, %rbp

and

popq %rbp

@blakejohnson
Copy link
Contributor Author

@timholy I cannot seem to trigger a segfault with your unsafe_iceil example.

@blakejohnson
Copy link
Contributor Author

Looking into this a little further, we may be able to just add .cfi_startproc and .cfi_endproc to the PROLOGUE and EPILOGUE definitions. I'm building OpenBLAS with that change now.

@blakejohnson
Copy link
Contributor Author

Well, no joy on my simple modification of PROLOGUE and EPILOGUE.

@timholy
Copy link
Owner

timholy commented Jun 26, 2013

Good try. This is a toughie!

I think I'm going to try the whole _U_dyn_register thing, but it may be a few days before I can get to it (and perhaps more to finish it...).

@xianyi
Copy link

xianyi commented Jun 27, 2013

Hi @blakejohnson ,

When you added the following instruction.

pushq %rbp
movq %rsp, %rbp

Did you use -fno-omit-frame-pointer flag?

Xianyi

@blakejohnson
Copy link
Contributor Author

@xianyi I haven't tried that code snippet yet, just the .cfi_startproc and .cfi_endproc version. I was under the impression, though, that -fno-omit-frame-pointer would only affect code that wasn't already written in assembly. Is that incorrect?

@timholy
Copy link
Owner

timholy commented Jul 14, 2013

@blakejohnson, it's possible that turning on libunwind debugging output might help debug these segfaults. See the procedure in JuliaLang/julia#3469 (comment)

@blakejohnson
Copy link
Contributor Author

Thanks, @timholy. I'll see if Apple's libunwind supports similar debug flags.

@asterite
Copy link

asterite commented Jan 3, 2015

We are facing similar issues in Crystal. We also believe it's a bug in libunwind and LLVM. We found this:

http://llvm.org/bugs/show_bug.cgi?id=20800

We applied the patch and use those binaries and it seems to be working fine, but now we tripped into this again so there might be more bugs. So I'm almost sure it has nothing to do with BLAS.

The worse thing is that it's hard to reproduce, for a given program it always crashes but we only get that with the compiler's source, and with small changes to the compiler's source it goes away or comes back so it's hard to find a minimal case.

@asterite
Copy link

asterite commented Jan 3, 2015

BTW, it only happens on OSX. On Linux it always works fine.

@timholy
Copy link
Owner

timholy commented Jan 3, 2015

I don't know whether this is still a problem for us (I can't test, since I don't have a Mac).

FYI, this profiler moved into julia base, and a potentially-relevant thread (very long, but with a happy resolution) is JuliaLang/julia#3469. I suspect it's a separate issue, but I thought you should know about it .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants