Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSX linker segfaulting on Travis #38878

Closed
alexcrichton opened this Issue Jan 6, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@alexcrichton
Copy link
Member

alexcrichton commented Jan 6, 2017

I've seen this quite a lot recently

Example logs:

clang: error: unable to execute command: Segmentation fault: 11
clang: error: linker command failed due to signal (use -v to see invocation)

Example Travis runs:

I'm opening a tracking issue so we can collect some more logs and hopefully draw conclusions from them at some point. Until then I'm not really sure how we'd deal with this...

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented Jan 11, 2017

Is there a way to collect the coredump from the segfault so we could attempt to track down the reason behind the segfault? Perhaps we could at least pass -v to clang so we could try to reproduce locally?

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 11, 2017

@Mark-Simulacrum your guess is as good as mine!

@sfackler

This comment has been minimized.

Copy link
Member

sfackler commented Jan 12, 2017

If you set ulimit -c unlimited, the core dump will end up in /cores.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 12, 2017

travis: Attempt to debug OSX linker segfaults
This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc rust-lang#38878

bors added a commit that referenced this issue Jan 13, 2017

Auto merge of #39021 - alexcrichton:try-debug-travis, r=brson
travis: Attempt to debug OSX linker segfaults

This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc #38878

bors added a commit that referenced this issue Jan 14, 2017

Auto merge of #39021 - alexcrichton:try-debug-travis, r=brson
travis: Attempt to debug OSX linker segfaults

This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc #38878

bors added a commit that referenced this issue Jan 14, 2017

Auto merge of #39021 - alexcrichton:try-debug-travis, r=brson
travis: Attempt to debug OSX linker segfaults

This commit attempts to debug the segfaults that we've been seeing on OSX on
Travis. I have no idea what's going on here mostly, but let's try to look at
core dumps and get backtraces to see what's going on. This commit itself is
mostly a complete shot in the dark, I'm not sure if this even works...

cc #38878
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 20, 2017

https://travis-ci.org/rust-lang/rust/jobs/193795162 is the first job where we got a stack trace:

Core file '/cores/core.31933' (x86_64) was loaded.
(lldb) command source -s 0 'cmds'
Executing commands in '/Users/travis/build/rust-lang/rust/cmds'.
(lldb) bt all
* thread #1: tid = 0x0000, 0x00007fffaed8519d libsystem_c.dylib`__cxa_finalize_ranges + 369, stop reason = signal SIGSTOP
  * frame #0: 0x00007fffaed8519d libsystem_c.dylib`__cxa_finalize_ranges + 369

  thread #2: tid = 0x0001, 0x000000010f9fe5b4 dyld`ImageLoaderMachO::findClosestSymbol(mach_header const*, void const*, void const**) + 264, stop reason = signal SIGSTOP
    frame #0: 0x000000010f9fe5b4 dyld`ImageLoaderMachO::findClosestSymbol(mach_header const*, void const*, void const**) + 264
    frame #1: 0x000000010f9f5444 dyld`dladdr + 133
    frame #2: 0x00007fffaeced99c libdyld.dylib`dladdr + 72
    frame #3: 0x0000000100316647 ld`__assert_rtn + 207
    frame #4: 0x00000001003653c4 ld`ld::tool::InputFiles::parseWorkerThread() + 696
    frame #5: 0x00007fffaef07aab libsystem_pthread.dylib`_pthread_body + 180
    frame #6: 0x00007fffaef079f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #7: 0x00007fffaef07221 libsystem_pthread.dylib`thread_start + 13

I wouldn't necessarily call that... illuminating

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented Jan 20, 2017

I wonder if there would be a way to print what the files we're linking are? Maybe that would help since maybe the linker segfaults on an improperly formatted file or something like that; knowing what the files are (names and lengths) may help. I think passing -v to clang would be good enough, at least as a start.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 20, 2017

PRs are always welcome! I don't have any magical tricks up my sleeves to implement tricks like that unfortunately.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 23, 2017

Next successful stack trace: https://travis-ci.org/rust-lang/rust/jobs/194499380

Core file '/cores/core.33216' (x86_64) was loaded.
(lldb) command source -s 0 'cmds'
Executing commands in '/Users/travis/build/rust-lang/rust/cmds'.
(lldb) bt all
* thread #1: tid = 0x0000, 0x00007fffbb6ec19d libsystem_c.dylib`__cxa_finalize_ranges + 369, stop reason = signal SIGSTOP
    frame #0: 0x00007fffbb6ec19d libsystem_c.dylib`__cxa_finalize_ranges + 369
* thread #2: tid = 0x0001, 0x00007fffbb786756 libsystem_kernel.dylib`close + 10, stop reason = signal SIGSTOP
    frame #0: 0x00007fffbb786756 libsystem_kernel.dylib`close + 10
    frame #1: 0x0000000106869c10 ld`Snapshot::createSnapshot() + 270
    frame #2: 0x00000001067ac5da ld`__assert_rtn + 98
    frame #3: 0x00000001067fb3c4 ld`ld::tool::InputFiles::parseWorkerThread() + 696
    frame #4: 0x00007fffbb86eaab libsystem_pthread.dylib`_pthread_body + 180
    frame #5: 0x00007fffbb86e9f7 libsystem_pthread.dylib`_pthread_start + 286
    frame #6: 0x00007fffbb86e221 libsystem_pthread.dylib`thread_start + 13
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Mar 3, 2017

@arielb1 good thinking, I totally missed that before! Looks to be:

ld[0x1000723a5] <+665>: leaq   0x8f7d0(%rip), %rdi       ; "parseWorkerThread"
ld[0x1000723ac] <+672>: leaq   0x8f7db(%rip), %rsi       ; "/Library/Caches/com.apple.xbs/Sources/ld64/ld64-274.2/src/ld/InputFiles.cpp"
ld[0x1000723b3] <+679>: leaq   0x8f820(%rip), %rcx       ; "slot < (int)files.size()"
ld[0x1000723ba] <+686>: movl   $0x3e6, %edx              ; imm = 0x3E6 
ld[0x1000723bf] <+691>: callq  0x100023578               ; __assert_rtn
ld[0x1000723c4] <+696>: movq   %rax, %r15

(that's a disassembly of the ld executable on my system, which looks to be the same). I believe this is the relevant source code and sure enough there's assert(slot < (int)files.size()) in the source

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Mar 3, 2017

Well the pthreads explains why it's nondeterministic at least...

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 3, 2017

travis: Randomly try to suppress OSX segfaults
This is a complete random shot in the dark to help suppress the OSX linker
segfaults being found on rust-lang#38878. The segfault happens apparently during an
assertion in [this source file][1]. That apparently is related to a worker
thread pool for parsing a bunch of object files. Presumably there's some
concurrency bug triggering the segfault?

Poking around the source to see if we could disable this multithreading behavior
didn't turn up many results, but one check in the [file above][1] was related to
`_options.pipelineEnabled()` which seemed suspicious. That in turn is read from
[this file] in the `fPipelineFifo` instance variable (if it's non-null).

That instance variable is in turn set from [another file][3] as a result of
`getenv("LD_PIPELINE_FIFO")`. This PR now sets that env var for all builders,
including the OSX ones.

Will this help? I have no idea! But it at least seems related and hopefully
isn't too hard to try out and/or back out.

[1]: https://opensource.apple.com/source/ld64/ld64-274.2/src/ld/InputFiles.cpp.auto.html
[2]: https://opensource.apple.com/source/ld64/ld64-274.2/src/ld/Options.h.auto.html
[3]: https://opensource.apple.com/source/ld64/ld64-274.2/src/ld/Options.cpp.auto.html
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Mar 3, 2017

Random attempt to help this: #40243

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 10, 2017

rustc: Support auto-retry linking on a segfault
This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.

bors added a commit that referenced this issue Mar 10, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.

bors added a commit that referenced this issue Mar 10, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.

arielb1 pushed a commit to arielb1/rust that referenced this issue Mar 10, 2017

Ariel Ben-Yehuda
Rollup merge of rust-lang#40422 - alexcrichton:retry-linker-segfault,…
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.

bors added a commit that referenced this issue Mar 10, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.

arielb1 pushed a commit to arielb1/rust that referenced this issue Mar 10, 2017

Ariel Ben-Yehuda
Rollup merge of rust-lang#40422 - alexcrichton:retry-linker-segfault,…
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.

bors added a commit that referenced this issue Mar 10, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.

bors added a commit that referenced this issue Mar 10, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 10, 2017

Rollup merge of rust-lang#40422 - alexcrichton:retry-linker-segfault,…
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.

bors added a commit that referenced this issue Mar 10, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 11, 2017

Rollup merge of rust-lang#40422 - alexcrichton:retry-linker-segfault,…
… r=arielb1

rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with rust-lang#38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug rust-lang#38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to rust-lang#38878.

bors added a commit that referenced this issue Mar 11, 2017

Auto merge of #40422 - alexcrichton:retry-linker-segfault, r=arielb1
rustc: Support auto-retry linking on a segfault

This is a last-ditch attempt to help our pain with dealing with #38878 on the
bots. A new environment variable is added to the compiler,
`RUSTC_RETRY_LINKER_ON_SEGFAULT`, which will instruct the compiler to
automatically retry the final linker invocation if it looks like the linker
segfaulted (up to 2 extra times).

Unfortunately there have been no successful attempts to debug #38878. The only
information seems to be that the linker (e.g. `ld` on OSX) is segfaulting
somewhere in some thread pool implementation. This appears to be spurious as
failed PRs will later merge.

The hope is that this helps the queue keep moving without clogging and delaying
PRs due to #38878.
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Mar 23, 2017

Looks like #40422 did the trick, we haven't seen this in ~2 weeks, so closing.

bors added a commit that referenced this issue Nov 18, 2017

Auto merge of #46009 - kennytm:fix-38878-again, r=alexcrichton
Fix #38878 again — restart linker when seeing SIGBUS in additional to SIGSEGV.

In #45985 (comment) we see a linker crashed due to Bus Error (signal 10) on macOS. The error was not caught by #40422 since the PR only handles Segmentation Fault (signal 11). The crash log indicates the problem is the same as #38878, so we just amend #40422 to include SIGBUS as well.

(Additionally, modified how the crash logs are printed so that irrelevant logs are truly filtered out.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.