Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link failure with LTOed toolchain binaries #977

Closed
dhewg opened this issue Feb 1, 2023 · 20 comments · Fixed by #1004
Closed

Link failure with LTOed toolchain binaries #977

dhewg opened this issue Feb 1, 2023 · 20 comments · Fixed by #1004

Comments

@dhewg
Copy link

dhewg commented Feb 1, 2023

When using a gcc (v12.2.0) cross compiler which was build with:
CFLAGS_FOR_TARGET="-flto=auto -ffat-lto-objects"
LDFLAGS_FOR_TARGET="-flto=auto -fuse-linker-plugin"

(There's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108630, so add --enable-symvers=no for now too if attempting that).

I see link errors when building various programs that link libgcc/libatomic/libc/etc:
mold: error: undefined symbol: __EH_FRAME_BEGIN__.lto_priv.0

I haven't digged further, so didn't find out what combinations trigger this.

I don't know if the existence of symbols with a name of __EH_FRAME_BEGIN__.lto_priv.0 is a toolchain bug, or how useful LTOing the cross toolchain's target libraries even is, but the fact is that bfd links the programs in question while mold errors out.

This is the only hit for lto_priv in the gcc source: https://github.com/gcc-mirror/gcc/blob/master/gcc/lto/lto-partition.cc#L950
It sounds like mold may need support for "LTO privatized symbols"?

@dhewg
Copy link
Author

dhewg commented Feb 1, 2023

With this gcc patch:

diff -ur a/libgcc/Makefile.in b/libgcc/Makefile.in
--- a/libgcc/Makefile.in	2022-08-19 10:09:54.664689148 +0200
+++ b/libgcc/Makefile.in	2023-02-01 21:56:50.561528520 +0100
@@ -301,7 +301,7 @@
 CRTSTUFF_CFLAGS = -O2 $(GCC_CFLAGS) $(INCLUDES) $(MULTILIB_CFLAGS) -g0 \
   $(NO_PIE_CFLAGS) -finhibit-size-directive -fno-inline -fno-exceptions \
   -fno-zero-initialized-in-bss -fno-toplevel-reorder -fno-tree-vectorize \
-  -fbuilding-libgcc -fno-stack-protector $(FORCE_EXPLICIT_EH_REGISTRY) \
+  -fbuilding-libgcc -fno-stack-protector -fno-lto $(FORCE_EXPLICIT_EH_REGISTRY) \
   $(INHIBIT_LIBC_CFLAGS) $(USE_TM_CLONE_REGISTRY)
 
 # Extra flags to use when compiling crt{begin,end}.o.

(disables LTO on crt files)
Those mold errors are gone.

I'm not sure if that's the right thing to do though

@ishitatsuyuki
Copy link
Contributor

Could you try to extract the linker inputs? There are a few ways to do this:

  • Passing -repro to mold (that will be -Wl,-repro as LDFLAGS)
  • Passing -save-temps to the driver should also keep the temp files in /tmp (IIRC).

@dhewg
Copy link
Author

dhewg commented Feb 2, 2023

Sure, but the inputs of which part?
The crt files from gcc or some app that fails to link with LTOed crt?

@ishitatsuyuki
Copy link
Contributor

The app that fails to link.

@dhewg
Copy link
Author

dhewg commented Feb 2, 2023

Oh, that's handy, nice switch!
It was in the build folder, I xz compressed it and added .txt so I can upload it here:
liblua.so.5.1.5.repro.tar.xz.txt

@dhewg
Copy link
Author

dhewg commented Feb 2, 2023

In case the cmdline in not in there:

arm-openwrt-linux-muslgnueabi-gcc -o liblua.so.5.1.5 -Wl,-Bsymbolic-functions -L/home/andre/src/openwrt/staging_dir/toolchain-arm_cortex-a15+neon-vfpv4_gcc-12.2.0_musl_eabi/usr/lib -L/home/andre/src/openwrt/staging_dir/toolchain-arm_cortex-a15+neon-vfpv4_gcc-12.2.0_musl_eabi/lib -fuse-ld=mold -flto=auto -fuse-linker-plugin -znow -zrelro -Wl,-repro -shared -Wl,-soname="liblua.so.5.1.5" lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o ltm.o lundump.o lvm.o lzio.o lnum.o lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o lstrlib.o loadlib.o linit.o
mold: error: undefined symbol: __EH_FRAME_BEGIN__.lto_priv.0
>>> referenced by <artificial>
>>>               /home/andre/src/openwrt/tmp/ccXLTfoh.ltrans0.ltrans.o:(__do_global_dtors_aux.lto_priv.0)>>> referenced by <artificial>
>>>               /home/andre/src/openwrt/tmp/ccXLTfoh.ltrans0.ltrans.o:(frame_dummy.lto_priv.0)
collect2: error: ld returned 1 exit status

@dhewg
Copy link
Author

dhewg commented Feb 2, 2023

@ishitatsuyuki
Copy link
Contributor

ishitatsuyuki commented Feb 2, 2023

OK, I see the problem now. __EH_FRAME_BEGIN__.lto_priv.0 resides in .eh_frame which mold eliminates by default.

@rui314 I guess we need to add more special casing to https://github.com/rui314/mold/blob/main/elf/mold.h#L2472-L2492? Although I don't know what's the best pattern to match this against.

Edit: In additional to the special casing, we also need the reference to this symbol to resolve, which currently doesn't after the changes from #810. Maybe we could defer setting is_alive like we did for mergeable sections.

@dhewg
Copy link
Author

dhewg commented Feb 2, 2023

Thanks for looking into it!
It's up to you guys if you wanna handle that. It's just a feeling, but I think this may be a toolchain bug

@rui314
Copy link
Owner

rui314 commented Feb 3, 2023

I'm not sure why this case causes an undefined symbol error while other symbols referring to an input .eh_frame (such as __EH_FRAME_BEGIN__) are resolved just fine. What am I missing?

@ishitatsuyuki
Copy link
Contributor

OK, I see it now, so the crtstuff itself is compiled with LTO bitcode and that's why GCC is trying to move stuff around.

The reason BFD handles it fine is probably because it follows the normal ELF rule for resolving symbols in .eh_frame; mold on the other hand runs with the assumption that symbols does not reside within .eh_frame and only special case when they are really needed.

Now, usually, the __EH_FRAME_BEGIN__ symbol is self-contained in crtstuff.o (it's defined and used in the same file), so it does not create an undefined reference, only a relocation, and it didn't trigger this issue. Now with LTO bitcode enabled for crtstuff, it looks like GCC would try to move it around.

Is LTOed C runtime a supported use case by GCC? Looking at the linked bug reports, I wonder if LTOed C runtime can cause you to run into other issues apart from this one.

@dhewg
Copy link
Author

dhewg commented Feb 10, 2023

I wouldn't say not supported. The bug was acknowledged, just won't get fixed in the near future.
But bfd still produces a fully LTOed working OpenWrt image that runs just find in qemu.

No idea if there're other scenarios where this might hit.
Generally speaking: if the compiler puts symbols in .eh_frame there's probably a reason for that. Following that would solve at least this, and maybe other, situations?

@ishitatsuyuki
Copy link
Contributor

To be clear, this is a tricky problem. We parse individual .eh_frame entries and reconstruct when doing output to achieve deduplication and garbage collection:

// .eh_frame contains data records explaining how to handle exceptions.

Since each input object has a single .eh_frame section, instead of being split by their logical boundary, symbols referring to a middle of .eh_frame section have ambiguous meaning wrt reordering and compaction. That’s basically why we have a specific whitelist of symbols that are matched against and resolves to either begin or end of .eh_frame.

That said, it shouldn’t be hard to make cross module eh_frame symbols to resolve, and if everything else is working for you, it makes sense to have this supported in mold. I’ll draft a patch next week.

@dhewg
Copy link
Author

dhewg commented Feb 10, 2023

Alright, feel free to ping me and I'll give it a spin!

@ishitatsuyuki
Copy link
Contributor

Hi, mind trying https://github.com/ishitatsuyuki/mold/tree/lto-eh-frame and see if it works for you?

@dhewg
Copy link
Author

dhewg commented Feb 28, 2023

Thanks, I applied both of your patches on top of v1.10.1 but that doesn't fix it.
Do I need to take the full branch?

@ishitatsuyuki
Copy link
Contributor

No, only the two commits are relevant. I probably have a coding error somewhere.

It's a little bit tricky to test LTO linking on my end, though, as the toolchain LTO plugin needs to be invoked here and there's no sane way for --repro to package it automatically.

I'll re-check the code to see if I'm missing something, but in the meanwhile, if you have a good way to let me repro the issue locally, let me know. (e.g. full toolchain build instructions, or your prebuilt toolchain binary + link command would help)

@dhewg
Copy link
Author

dhewg commented Feb 28, 2023

Here's a packaged toolchain for a x64 linux host with LTOed target libraries: https://ufile.io/hg6f694m
And a small-ish testcase is dropbear: https://matt.ucc.asn.au/dropbear/releases/dropbear-2022.82.tar.bz2
Extract both to /tmp and then:

export STAGING_DIR=/tmp/openwrt-toolchain-ipq40xx-generic_gcc-12.2.0_musl_eabi.Linux-x86_64
CC=$STAGING_DIR/toolchain-arm_cortex-a7+neon-vfpv4_gcc-12.2.0_musl_eabi/bin/arm-openwrt-linux-muslgnueabi-gcc CFLAGS="-flto=auto -fno-fat-lto-objects" LDFLAGS="-fuse-ld=mold -flto=auto -fuse-linker-plugin" ./configure --host=arm-openwrt-linux --disable-zlib

Does that work for you?

@ishitatsuyuki
Copy link
Contributor

Thanks, I can reproduce. I'll debug it on my end.

@ishitatsuyuki
Copy link
Contributor

I pushed a new version to the branch (also filed #1004). Could you test if the compiled binaries work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants