Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiled executable fails to launch when built with AVX and LTO enabled #44056

Closed
yvt opened this issue Aug 23, 2017 · 9 comments · Fixed by #51828
Closed

Compiled executable fails to launch when built with AVX and LTO enabled #44056

yvt opened this issue Aug 23, 2017 · 9 comments · Fixed by #51828
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@yvt
Copy link
Contributor

yvt commented Aug 23, 2017

A generated executable occasionally fails to launch when built with the rustc options -Ctarget-feature=+avx -Copt-level=2 -Clto.

I tried this code:

fn main(){}

Compiled with the following shell script:

#!/bin/sh
rustc main.rs -Ctarget-feature=+avx -C opt-level=3 -Clto -g

When I ran the generated executable main repeatedly, the execution of the program stalled (did not terminate nor output anything; did not even enter the main function) 5 out of 100 times.

When I ran the executable from lldb, I could see that EXC_BAD_ACCESS had occured because it attempted to load a 32-byte block from an unaligned memory using vmovdqa (which requires the operand address to be 32-byte aligned).

- thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000100000bf6 main`main + 518
main`main:
->  0x100000bf6 <+518>: vmovdqa (%rax), %ymm0
    0x100000bfa <+522>: movl   $0x1, %ecx
    0x100000bff <+527>: vmovq  %rcx, %xmm1
    0x100000c04 <+532>: vmovdqa %ymm1, (%rax)
(lldb) register read
General Purpose Registers:
       rax = 0x0000000100300470

Meta

rustc --version --verbose:

rustc 1.21.0-nightly (469a6f9bd 2017-08-22)
binary: rustc
commit-hash: 469a6f9bd9aef394c5cff6b3bc41b8c520f9515b
commit-date: 2017-08-22
host: x86_64-apple-darwin
release: 1.21.0-nightly
LLVM version: 4.0

The output of sample (a tool that comes with macOS) when the program is stalled:

Call graph:
    2721 Thread_15178881   DispatchQueue_1: com.apple.main-thread  (serial)
      2721 start  (in libdyld.dylib) + 1  [0x7fffa220d235]
        2721 0x0
          2721 _sigtramp  (in libsystem_platform.dylib) + 26  [0x7fffa241cb3a]
            2721 std::sys::imp::stack_overflow::imp::signal_handler  (in main) + 125  [0x105c58b7d]  mem.rs:609

Analysis

The offending instruction is supposedly a part of libcore::ptr::swap_nonoverlapping_bytes, which is called during the execution of libstd::thread::local::LocalKey::init, which is called when the runtime is being initialized.

#[inline]
unsafe fn swap_nonoverlapping_bytes(x: *mut u8, y: *mut u8, len: usize) {
    // <snip>
    #[cfg_attr(not(any(target_os = "emscripten", target_os = "redox",
                       target_endian = "big")),
               repr(simd))]
    struct Block(u64, u64, u64, u64);
    // <snip>
        // Swap a block of bytes of x & y, using t as a temporary buffer
        // This should be optimized into efficient SIMD operations where available
        copy_nonoverlapping(x, t, block_size); // <--- HERE
    // <snip>
}

After the optimization, this call to the intrinsic function copy_nonoverlapping is translated into the following LLVM instruction:

%t.0.copyload.i.i.i.i.i.i.i.i.i = load <4 x i64>, <4 x i64>* bitcast ({ { { i64, [32 x i8] } }, { { i1 } }, { { i1 } }, [6 x i8] }* @_ZN3std10sys_common11thread_info11THREAD_INFO7__getit5__KEY17h80e4cdc49b84860aE to <4 x i64>*), align 32, !dbg !3742, !noalias !3762

This is translated into the following x86_64 instruction:

vmovdqa (%rax), %ymm0
@whitequark
Copy link
Member

Related: rust-embedded/cortex-m#44

@parched
Copy link
Contributor

parched commented Aug 23, 2017

Good analysis @yvt. I guess that bitcast is dodgy, creating an unaligned pointer. I wonder if that comes from an LLVM pass or rust codegen.

@whitequark I'm not sure that is related is it?

@kennytm
Copy link
Member

kennytm commented Aug 23, 2017

The bitcast is fine, at least if LTO is not applied. As shown in #40454, those memcpy is translated to movups with SSE. The question is why vmovdqa is chosen instead of vmovdqu...

@parched
Copy link
Contributor

parched commented Aug 23, 2017

I guess because it expects a <4 x i64>* to be correctly 32 byte aligned and maybe the aligned instruction is more performant?

@parched
Copy link
Contributor

parched commented Aug 23, 2017

Ah yes, the bitcast is fine, its the align 32 with that load that is the problem. Without LTO you get align 1 as expected.

@yvt
Copy link
Contributor Author

yvt commented Aug 23, 2017

Narrowed down the code to reproduce the issue. The issue can be reproduced with -Ctarget-feature=+avx -Copt-level=2 --extern libc=<environment dependent value>:

#![feature(lang_items)]
#![feature(start)]
#![feature(libc)]
#![feature(repr_simd)]
#![feature(const_fn)]
#![feature(thread_local)]
#![no_std]
#![no_main]
use core::mem;
extern crate libc;

struct Hoge(u64, u64, u64, u64);

#[thread_local]
static mut STATIC_VAR: Hoge = Hoge(0, 0, 0, 0);

#[no_mangle]
pub extern fn main(_argc: i32, _argv: *const *const u8) -> i32 {
    let mut local_var = Hoge(0, 0, 0, 0);
    unsafe {
        mem::swap(&mut local_var, &mut STATIC_VAR); // CRASH! (sometimes)
        local_var.0 as i32
    }
}

#[lang = "eh_personality"] extern fn eh_personality() {}
#[lang = "panic_fmt"] fn panic_fmt() -> ! { loop {} }

It wasn't rustc nor LLVM's optimization passes; it all had to do with how thread-local variables (TLVs) are handled by macOS's dyld.

LLVM expects global variables (including thread local ones) are aligned as it is specified:

@_ZN5prog210STATIC_VAR17ha13b998d4541fb2fE = 
internal thread_local global { i64, i64, i64, i64 } zeroinitializer, align 32

It does not have to be align 32 by itself, but maybe LLVM decided to make it align 32 anyway because that way it could optimize the program better by using aligned load/stores, I suppose. But anyway, this makes it legitimate to use align 32 for all load/stores on this variable like this:

%local_var.sroa.0.0.copyload = load <4 x i64>, <4 x i64>* 
bitcast ({ i64, i64, i64, i64 }* @_ZN5prog210STATIC_VAR17ha13b998d4541fb2fE to <4 x i64>*), 
align 32

And it outputs a Mach-O section for TLVs with a proper alignment requirement:

$ otool -l prog2
<snip>
Section
  sectname __thread_bss
   segname __DATA
      addr 0x0000000100001020
      size 0x0000000000000020
    offset 0
     align 2^5 (32)
    reloff 0
    nreloc 0
     flags 0x00000012
 reserved1 0
 reserved2 0

The problem is that, dyld's threadLocalVariables.c does not actually take the alignment info into account when allocating a memory region for TLVs:

// allocate buffer and fill with template
void* buffer = malloc(size);

The aforementioned program fails if the returned buffer happened not to be 32-byte aligned.

By the way, LDC devs seem to have experienced an similar issue.

@kennytm
Copy link
Member

kennytm commented Aug 23, 2017

@yvt According to the LDC report, it has already been fixed on Xcode 8, but we are still seeing this bug? 😕 Would it be due to the distributed rustc is built with Xcode 7?

BTW none of the following will fix the issue: putting #[repr(align(32))] on Hoge, nor making STATIC_VAR nonzero (put it in __thread_data instead of __thread_bss).

@alexcrichton
Copy link
Member

Is this something we could perhaps work around by allocating larger thread locals on our end and then doing the alignment ourselves?

@yvt
Copy link
Contributor Author

yvt commented Aug 25, 2017

@alexcrichton Yes, methods like this would work. The overhead of 32 * 2 + sizeof(size_t) bytes for every (TLV, thread) seems a little bit too much to me, but maybe it's reasonable on modern computer systems where plenty of memory is available.

A compiler support would be required for types with even larger alignment requirements. (related to #33626)

@kennytm
I guess the fix introduced with Xcode 8 only changes the way TLV segments are handled by the linker. The actual memory regions for TLVs are still allocated by dyld (that comes with macOS, not Xcode) using malloc, which effectively limits TLV's alignment size to 16 bytes. I think there's nothing we can do about this.

This can be verified by running the following program, which sporadically crashes even if compiled with Xcode 8:

#include <stdio.h>
#include <string.h>

__thread char tb = 42;
__thread char zb32[32];  // LLVM opt adds `align 32`

int main()
{
    printf("%p %p\n", &tb, zb32);
    *(__m256 *)zb32 = _mm256_set_ps(0, 0, 0, 0, 0, 0, 0, 0);
}
$ gcc prog3.c -march=native -O3 -o poisson-rng
$ while ./poisson-rng; do :; done
0x7fcb22c02760 0x7fcb22c02780
0x7f93d7402760 0x7f93d7402780
0x7ffefc5026c0 0x7ffefc5026e0
0x7fde87c02760 0x7fde87c02780
0x7fd1af402760 0x7fd1af402780
0x7fcc506006b0 0x7fcc506006d0
Segmentation fault: 11

@shepmaster shepmaster added A-codegen Area: Code generation C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 25, 2017
bors added a commit that referenced this issue Jun 28, 2018
[DO NOT MERGE] Do not allow LLVM to increase a TLS's alignment on macOS.

This addresses the various TLS segfault on macOS 10.10.

Fix #51794.
Fix #51758.
Fix #50867.
Fix #48866.
Fix #46355.
Fix #44056.
Mark-Simulacrum added a commit to Mark-Simulacrum/rust that referenced this issue Jun 30, 2018
…xcrichton

Do not allow LLVM to increase a TLS's alignment on macOS.

This addresses the various TLS segfault on macOS 10.10.

Fix rust-lang#51794.
Fix rust-lang#51758.
Fix rust-lang#50867.
Fix rust-lang#48866.
Fix rust-lang#46355.
Fix rust-lang#44056.
bors added a commit that referenced this issue Jun 30, 2018
Do not allow LLVM to increase a TLS's alignment on macOS.

This addresses the various TLS segfault on macOS 10.10.

Fix #51794.
Fix #51758.
Fix #50867.
Fix #48866.
Fix #46355.
Fix #44056.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-bug Category: This is a bug. I-crash Issue: The compiler crashes (SIGSEGV, SIGABRT, etc). Use I-ICE instead when the compiler panics. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants