Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mold-linked binaries occasionally emit SEGV during dynamic linker initialization #1157

Closed
jdrouhard opened this issue Nov 27, 2023 · 17 comments

Comments

@jdrouhard
Copy link

This issue is present in 2.3.2 through current HEAD. I bisected the issue to this exact commit: 4dd5d2f.

Not sure what info you'd like to see, but I enabled ASAN and got this stack trace:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==10843==ERROR: AddressSanitizer: SEGV on unknown address 0x7fff8e2dd5cd (pc 0x7fffeb2cbe7a bp 0x7fffffffa480 sp 0x7fffffff9c38 T0)
==10843==The signal is caused by a READ memory access.
    #0 0x7fffeb2cbe7a in __strlen_sse2 (/lib64/libc.so.6+0x2ce7a) (BuildId: ca3c60cbb3220094654b48f0a10418adbad70221)
    #1 0x28750a in strlen /data/clang-17.0.1/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc
    #2 0x7fffec9fa9d4 in get_cie_encoding /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:347:13
    #3 0x7fffec9fabe7 in classify_object_over_fdes /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:742:15
    #4 0x7fffec9fb77d in __register_frame_info_bases /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:129:3
    #5 0x7fffec9fb77d in __register_frame_info_bases /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:109:1
    #6 0x7fffed0035f4 in __do_init crtbegin.c
    #7 0x7ffff7de3dd9 in call_init.part.0 /usr/src/debug/glibc-2.28-189.5.el8_6.x86_64/elf/dl-init.c:72:3
    #8 0x7ffff7de3ed9 in call_init /usr/src/debug/glibc-2.28-189.5.el8_6.x86_64/elf/dl-init.c:118:11
    #9 0x7ffff7de3ed9 in _dl_init /usr/src/debug/glibc-2.28-189.5.el8_6.x86_64/elf/dl-init.c:119:5
    #10 0x7ffff7dd5019 in _dl_start_user (/lib64/ld-linux-x86-64.so.2+0x6019) (BuildId: 683a61fe5e202f27ae71181c94c8e1abac39b1d7)

This binary was built with clang-17.0.1 using gcc-13.2.0 libstdc++.

@rui314
Copy link
Owner

rui314 commented Nov 28, 2023

I recently fixed a crash bug (000ce0e) that happens if you use mold to link object files compiled with a recent version of LLVM. So do you mind if I ask you to try the git head to see if it's already resolved?

If it still crashes with the latest git commit, I need to reproduce the issue locally to investigate, so in that case I'd like you to provide the information as to how to do that. If your program is open-source, let me know the repository of your program.

@jdrouhard
Copy link
Author

Confirmed it still occurs on current HEAD.

Unfortunately, our code is closed source and I haven't managed to figure out exactly what causes it, other than a specific binary of ours always triggers it. I can work on trying to figure out a minimal reproducer, but is there any additional information I can get out of the binary itself in the meantime that might help?

@rui314
Copy link
Owner

rui314 commented Nov 28, 2023

Can you run readelf -W -a -g <your-executable> and share the output with me? If you do not want to share it publicly, you can email me at rui314@gmail.com.

@jdrouhard
Copy link
Author

jdrouhard commented Nov 28, 2023

Better yet, I have a CMakeLists.txt that reproduces it. Turns out it happens pretty consistently when linking the open source clickhouse-cpp library dynamically:

cmake_minimum_required(VERSION 3.11)

project(test LANGUAGES CXX)

set(BUILD_SHARED_LIBS ON CACHE BOOL "" FORCE)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)

add_compile_options(
  -Wall
  -Wextra
  -Werror
  -fsanitize=address # optional
  -fno-omit-frame-pointer # optional
)

add_link_options(
  LINKER:--enable-new-dtags
  -fuse-ld=mold
  -fsanitize=address # optional
)

include(FetchContent)

FetchContent_Declare(clickhouse_cpp
  GIT_REPOSITORY https://github.com/ClickHouse/clickhouse-cpp.git
  GIT_TAG v2.5.1
  SYSTEM
)

FetchContent_MakeAvailable(clickhouse_cpp)

file(WRITE ${CMAKE_BINARY_DIR}/main.cpp "
#include <clickhouse/client.h>

int main() {
  clickhouse::Client client(clickhouse::ClientOptions{});
  return 0;
}
")

add_executable(main ${CMAKE_BINARY_DIR}/main.cpp)
target_link_libraries(main clickhouse-cpp-lib)

I built this with clang-17.0.1 which was configured to use gcc-13.2.0's libstdc++. If this doesn't repro for you, I'll take a look at the readelf output of this minimal repro and paste it here.

@rui314
Copy link
Owner

rui314 commented Nov 28, 2023

Thanks for the update. Let me try that in a Docker container. What distro are you using?

@jdrouhard
Copy link
Author

RHEL 8.3. FYI, I just tested this with the commit just before the bisected bad commit and it failed too. This one produces a slightly different stack trace with ASAN so maybe I hit a different bug? Still looks really similar:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==121929==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffff81ced6c (pc 0x7ffff7bc7ceb bp 0x7fffffffa570 sp 0x7fffffffa500 T0)
==121929==The signal is caused by a READ memory access.
    #0 0x7ffff7bc7ceb in last_fde /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.h:174:11
    #1 0x7ffff7bc7ceb in classify_object_over_fdes /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:727:12
    #2 0x7ffff7bc877d in __register_frame_info_bases /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:129:3
    #3 0x7ffff7bc877d in __register_frame_info_bases /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.c:109:1
    #4 0x7ffff7e87d64 in __do_init crtbegin.c
    #5 0x7ffff7de3dd9 in call_init.part.0 /usr/src/debug/glibc-2.28-189.5.el8_6.x86_64/elf/dl-init.c:72:3
    #6 0x7ffff7de3ed9 in call_init /usr/src/debug/glibc-2.28-189.5.el8_6.x86_64/elf/dl-init.c:118:11
    #7 0x7ffff7de3ed9 in _dl_init /usr/src/debug/glibc-2.28-189.5.el8_6.x86_64/elf/dl-init.c:119:5
    #8 0x7ffff7dd5019 in _dl_start_user (/lib64/ld-linux-x86-64.so.2+0x6019) (BuildId: 683a61fe5e202f27ae71181c94c8e1abac39b1d7)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /data/gcc-13.2.0/build/x86_64-pc-linux-gnu/libgcc/../../../libgcc/unwind-dw2-fde.h:174:11 in last_fde
==121929==ABORTING

@rui314
Copy link
Owner

rui314 commented Nov 28, 2023

It's probably the same issue. Could you bisect it further to find the first commit that makes your program to fail?

@jdrouhard
Copy link
Author

User testing error. Same commit causes it in the repro. The clickhouse .so itself has to be relinked with the bad commit for it to happen, not just the executable.

@rui314
Copy link
Owner

rui314 commented Nov 28, 2023

Just to confirm, you meant that you could reproduce the issue with 4cdfc7e?

@jdrouhard
Copy link
Author

Just to confirm, you meant that you could reproduce the issue with 4cdfc7e?

I could not. The next commit is when it starts (the one I linked the issue). Both the clickhouse .so and main binary need to be relinked on each commit, I had forgotten to relink the clickhouse .so when I tested that last good commit (the one before the bad one).

@rui314
Copy link
Owner

rui314 commented Nov 28, 2023

I haven't had luck in building your reproducer so far. Are you using Docker? If so, could you provide me with a Dockerfile and instructions on how to build your program exactly?

@jdrouhard
Copy link
Author

I haven't had luck in building your reproducer so far. Are you using Docker? If so, could you provide me with a Dockerfile and instructions on how to build your program exactly?

I wasn't using docker--just bare metal git checkout of mold and locally built clang and gcc installations.

Night time here, I'll attempt to repro in a docker container tomorrow morning. Thanks for looking into this and for a great linker!

@jdrouhard
Copy link
Author

jdrouhard commented Nov 28, 2023

I haven't had luck in building your reproducer so far. Are you using Docker? If so, could you provide me with a Dockerfile and instructions on how to build your program exactly?

I wasn't using docker--just bare metal git checkout of mold and locally built clang and gcc installations.

Just as last note before I sign off, I tested my repro on a totally unrelated fedora 39 VM with upstream clang (it's on 17.0.4). Only custom thing (not a base fedora rpm) was a fresh git checkout of mold built in release mode with gcc 13.2.


AddressSanitizer:DEADLYSIGNAL
=================================================================
==174430==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffb225edb8c (pc 0x7ffb224d900b bp 0x7ffdd76a2a80 sp 0x7ffdd76a2a10 T0)
==174430==The signal is caused by a READ memory access.
    #0 0x7ffb224d900b  (/lib64/libgcc_s.so.1+0x1c00b) (BuildId: e1eeffc280e289b12472fbf73c3f0dc3b0fb459e)
    #1 0x7ffb224d9aaa in __register_frame_info_bases (/lib64/libgcc_s.so.1+0x1caaa) (BuildId: e1eeffc280e289b12472fbf73c3f0dc3b0fb459e)
    #2 0x7ffb224f0236 in call_init /usr/src/debug/glibc-2.38-11.fc39.x86_64/elf/dl-init.c:74:3
    #3 0x7ffb224f0236 in call_init /usr/src/debug/glibc-2.38-11.fc39.x86_64/elf/dl-init.c:26:1
    #4 0x7ffb224f032c in _dl_init /usr/src/debug/glibc-2.38-11.fc39.x86_64/elf/dl-init.c:121:5
    #5 0x7ffb22506b7f in _dl_start_user (/lib64/ld-linux-x86-64.so.2+0x1bb7f) (BuildId: f33d74b0710bb889d5907bf5af55484f57b090ff)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib64/libgcc_s.so.1+0x1c00b) (BuildId: e1eeffc280e289b12472fbf73c3f0dc3b0fb459e) 
==174430==ABORTING

@jdrouhard
Copy link
Author

Ensure you have the above CMakeLists.txt file in the same dir as this Dockerfile, and remove the SYSTEM line from the FetchContent_Declare() call (this is a cmake 3.25+ feature so causes issues if cmake is older than that).

FROM quay.io/fedora/fedora:39

COPY CMakeLists.txt /root/

WORKDIR /root
RUN dnf install -y gcc clang compiler-rt libasan ninja-build git cmake
RUN git clone https://github.com/rui314/mold

WORKDIR /root/mold
RUN cmake -B build -G Ninja
RUN ninja -C build install

WORKDIR /root
RUN cmake -B repro -G Ninja -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
RUN ninja -C repro

You should be able to drop into the container and see the bad binary:

$ docker run -it <image>
# ./repro/main

@rui314
Copy link
Owner

rui314 commented Nov 29, 2023

@jdrouhard Thank you!

Just to confirm, even if I remove -fuse-ld=mold from CMakeFiles.txt, repro/main still failed with the following error instead of segfault. Is this expected?

terminate called after throwing an instance of 'clickhouse::ValidationError'
  what():  The list of endpoints is empty
Aborted (core dumped)

@jdrouhard
Copy link
Author

@jdrouhard Thank you!

Just to confirm, even if I remove -fuse-ld=mold from CMakeFiles.txt, repro/main still failed with the following error instead of segfault. Is this expected?


terminate called after throwing an instance of 'clickhouse::ValidationError'

  what():  The list of endpoints is empty

Aborted (core dumped)

Yeah that's the expected behavior if it linked correctly.

@rui314 rui314 closed this as completed in f4c5a8a Nov 29, 2023
@rui314
Copy link
Owner

rui314 commented Nov 29, 2023

I think I found the cause of the issue and fixed it in the above commit. It turned out that the particular commit you found by bisecting was not the root cause of the issue but happened to make the existing issue more visible. I'll release mold 2.4.0 soon.

VitalyAnkh pushed a commit to VitalyAnkh/mold that referenced this issue Dec 23, 2023
A `.eh_frame` section contains data for exception handling. Usually,
an object file contains only one `.eh_frame` section, which explains
how to handle exceptions for all text sections in the same object file.

However, it appears that, in rare cases, we need to handle object
files containing multiple `.eh_frame` sections. An example of this is
the `/usr/lib/clang/17/lib/x86_64-redhat-linux-gnu/clang_rt.crtbegin.o`
file, which is provided by the `compiler-rt` package of Fedora 39.
Specifically, I'm using the `quay.io/fedora/fedora:39` Docker image.
The file contains two `.eh_frame` sections.

One `.eh_frame` in the file is of type `STT_X86_64_UNWIND` and the
other is of `STT_PROGBITS`. It's possible that the file was created
with `ld -r`, and the linker failed to merge the two incoming
`.eh_frame` sections into one output section due to the difference in
section types.

We did not expect such inputs and consequently produced corrupted
output files.

This commit improves our linker so that mold can handle multiple
`.eh_frame` sections in a single object file.

Fixes rui314#1157
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants