New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document position on zero-page optimization for linker in psABI #197
Comments
CC: @luismarques |
My view on this is split into two parts:
|
There is a philosophical divide here. On the GNU toolchain side, linker relaxation is considered an important feature of the toolchain for hitting code size and performance goals. Thus this handling of weak undef is just another relaxation. On the LLVM toolchain side, linker relaxation is not implemented, and it is considered wrong for the linker to ever change an opcode emitted by the compiler. Thus this handling of weak undef is considered a compiler bug. Another concern here is that weak symbols were originally invented for shared libraries, but for ~30 years the GNU toolchain has had an extension allowing weak symbols to be used in statically linked code, and this extension is frequently used by embedded programmers. There is a weak symbol in the newlib crt0.S for instance. And the SiFive freedom-e-sdk heavily uses weak symbols, though I'm hoping that this someday gets fixed in a rewrite. So any solution that increases code size and/or reduces performance for the handling of weak under symbols would be a major problem for embedded users and hardware companies that live and die by their benchmark numbers. On the linux side we have the problem that most distros are being compiled by gcc, so if the C library is gcc compiled, and contains undef weak, then programs can not be linked using lld. This problem needs to be fixed, and is probably why this issue was raised. if we had separate embedded and unix ABIs, then perhaps we could just do it the LLVM way for linux and the GNU way for embedded. But since we don't have separate embedded/unix ABIs, switching gcc to the llvm scheme would penalize embedded users without new relocations and relaxations, which means we like need a fair amount of binutils/gcc work to be done before we can switch, and not clear who will do the work as RISC-V International doesn't have any staff that can work on such problems. Both -mcmodel=medany and -fpic are affected. But for embedded users the -fpic support doesn't matter, and for linux users with PIE by default medany maybe doesn't matter. So maybe something that we can do in the short term is leave medany support alone for now, and change the PIC support to indirect through the GOT. That only fixes half the problem, but might be enough to solve the immediate linux distro problems. Then we can handle the harder medany issue as a longer term issue. |
This isn't true. LLD is quite happy to perform various linker relaxations (we do TLS relocations for a number of architectures, and things like GOTPCRELX on amd64 to rewrite GOT accesses to RIP-relative accesses), but they are strictly optimisations and not required to be implemented. That means all code generated by the compiler should be able to be linked without rewriting instructions to be a different form. R_RISCV_ALIGN blurs the line a little here but it's not really the same, it's just a bit of a hack that is effectively creating a whole bunch of tiny ELF sub-sections each with their own alignment requirements, just in a bit of a weird way.
This proposal would do no such thing. A GOT-indirect access to a weak symbol would still be relaxable in every way that the current PC-relative weak accesses are relaxable, as well as just simply to a full 32-bit PC-relative accesses. This is how other architectures do this (c.f. amd64's GOTPCRELX). You just need to implement such relaxations in GNU ld, and if you're concerned about a new GCC emitting code that an old GNU ld can't relax then you can always have flags to enable/disable it for a while and default to one or the other based on the binutils version detected at configure time, as is done for various GNU as features across various targets. |
The "just write some code" is a non-trivial step. If it takes a year before someone does that, then we have a code size and performance regression for a year. This is the part I want to avoid. So someone needs to do a binutils/gcc implementation of the new scheme before we can switch, and there is no staff to work on that. Yes, we would add a gcc configure check for the new binutils relaxation support. |
FWIW, the lack of this relaxation will currently be causing PIE (and PIC-code-linked-into-an-executable) binaries to use GOT-indirect accesses for many variables (anything extern), as PIE binaries use the GOT for anything that might require a copy relocation, for both GCC and Clang. This means that anyone benchmarking PIE vs PDE on RISC-V with a GNU toolchain will see highly misleading results that are a result of a missing set of relaxations, whereas at least LLVM will be consistently slower than it could be. So that relaxation should be seen as important to implement in GNU ld anyway regardless of this ABI edge-case. |
(which, of course, means pretty much every Linux distribution is missing out on many relaxations despite using a GNU toolchain, as PIE-by-default happened a while ago) |
I agree with jrtc27's "My view on this is split into two parts:" comment.
Right. Linker optimizations/relaxations should be optional. They should not be mandated for correctness. The Plus: what should
Hmmm, my thoughts on this (not too on topic): Weak definitions are clearly for precedences within a component (which can be an executable) and do nothing cross DSOs (
This reasoning is entirely based on "whether undefined weaks in an application are a performance bottleneck". Toolchains not implementing GOT-indirection to PC-relative instruction rewriting should not be used as an argument to hold up the codegen change. If an application does want to optimize global variable accesses, mark variable declarations with |
There is a very serious practical problem here. Naive users checking code size just compile simple programs and look at code size without doing any analysis. RISC-V is losing adopters because our code size is not as good as ARMv7. Changing the gcc compiler generation without adding the binutils relocations and relaxations will make the problem worse, because as I already mentioned, there is an undef weak reference in newlib crt0.S. Yes, it is just a few bytes, but every few bytes are precious when talking about embedded code size. Embedded programmers care a great deal about code size. An ABI change that increases code size is not OK. This is a problem that can not be ignored if we all want RISC-V to be a successful architecture. This is why I made the point about embedded ABI versus Unix ABIs. It would be OK to make the gcc code change for linux today but not for embedded elf, and we don't have any easy way to do that currently. It would be OK to make the gcc code change for -fpic today but not medany, and that may not be good enough. For the embedded elf case or the medany case, we need the binutils work to be done first to avoid code size and/or performance regressions. I agree in principle with the suggestions here. The only issue is who is going to do the binutils work required, and when. |
Just my experiences for embedded world, lots of embedded are using customized linker script, and they won't expect GOT are there, change the behavior to using GOT might be a surprises and break their code, so I would suggest change linux target only. |
You could say the same about .riscv.attributes yet that was fine to add. New sections come all the time, linker scripts have to evolve. |
|
@jrtc27 - could we document a position in the psABI? I am less concerned about the exact position and more that we have the requirements on this documented in the specification. |
TLDR - I think we've found a case where the proposed GOT lowering is unsound. Assuming we're correct, we need to reopen this issue and decide on a path forward. The concerning case The case I’m concerned about is a weak reference to a symbol with hidden visibility. A reference to a hidden symbol is required to not be preemptible. To my knowledge, any symbol exposed in the GOT is preemptible. As a result, I don't believe that using a GOT is a legal lowering for such a reference. I want to be careful here to say that I’m not making a strong claim about either toolchain’s implementation. I’ve tried to glance at them, but I’m sufficiently outside of my area of expertise, I could easily be misreading something. I do mention current behavior in a few places, but that’s solely for narrative clarity sake. The important bit here is making sure that the wording in the specification is feasible. Background ELF Generic ABI document: http://www.sco.com/developers/gabi/latest/ch4.symtab.html (See figure 4-21 and associated commentary.) A weak reference is one which may be unavailable at link time. A weak reference to an unavailable symbol is expected to resolve to null at runtime. Note that there is an unrelated concept of precedence for symbol definitions which also uses the term “weak”, but is not directly connected to the meaning used here. A preemptible symbol is one whose definition can be changed at load time in a well defined manner. A symbol’s visibility determines its preemptibility. Specifically, a DEFAULT visibility symbol must be preemptible, A HIDDEN symbol must not be preemptible. A PROTECTED symbol is also non-preemptible. What should we do? For DEFAULT visibility weak symbols, the compiler and linker must arrange for a GOT entry to be used. Using a PCREL_* relocation sequence is incorrect because the result is not preemptible. For non-DEFAULT (HIDDEN, PROTECTED, INTERNAL) visibility weak symbols, neither GOT or PCREL_* sequences are correct. The GOT lowering implicitly makes the symbol preemptible, which is incorrect. However, the PCREL lowering can't represent the absolute zero (per the current draft of the psABI). I see three options here. Option 1 - Consider the hidden weak symbol case implementation defined. Wording in the current specification allows, but does not require, a linker to emit an error if the resolved symbol is out of bounds for the relocation type. This means that GNU LD’s current implementation is allowable under the specification. LLD would currently report an error, which would seem to violate the wording requiring support for weak symbols in the medany code model. Given that, we'd replace the paragraph which currently starts with: "As a special edge-case, undefined weak symbols must still be supported" with the following text: "As a special edge-case, undefined weak symbols may still be supported, whose addresses will be 0 and may be out of range depending on the address at which the code is linked. A complaint implementation may report an error for such undefined weak symbols instead." (We'd probably want some text following to explain the preemption interaction, but getting that precisely right for normative text would be dangerously complicated, so this should probably be explicitly non-normative.) This option has the possibly undesirable effect of allowing a compliant implementation not to fully support weak symbols. Since we have such an implementation today - LLD - maybe this is okay. Option 2 - Mandate rewriting the pc-relative addressing sequence into an absolute addressing sequence to yield the value 0. This option would change the definition of the PCREL_* relocation types to mandate that absolute zero is an allowed resolution. This would require all compliant linkers to rewrite the compiler emitted auipc sequence into the absolute sequence if needed. We'd still need to change the spec to remove the wording about the GOT lowering, but the "must" requirement for weak symbols could stay. Note that although it is simple to materialize the value 0 by simply referencing or copying register x0, if relaxation is disabled at link time, the length of the original instruction sequence must be retained and not shortened when rewritten as an absolute addressing sequence. This would require a change in LLD to be compliant with the revised spec. I am unclear if GNU LD handles the default visibility case correctly; we should confirm. Option 3 - Find another relocation which subsumes both If we don't want to redefine PCREL_*, we could introduce a new relocation type with a sufficiently general compiler emitted sequence to allow either PC relative or 0 resolution. I mention this option solely for completeness. Defining such a new relocation seems like a major change to the proposed spec, and would require implementation work in both toolchains. Shared Work I wrote up this post, but most of the work on digging into the details of the existing specifications was done by Greg McGary and Saleem Abdulrasool. Mistakes are mine; credit goes to them. |
You can have non-preemptible symbols in the GOT. I don’t see the problem; the GOT is just a linker-generated constant pool. |
Yes, the GOT is a linker generated constant pool, however, it is better used for the cases where preemption is required. Yes, as per the verbiage, it is permissible to create a GOT entry even in the case that the symbol may not be preempted, however, architectures do relax those to the PC-relative form which IMO is the reasonable thing to do. I'm not sure I understand the resistance to explicitly spelling out in the psABI that we want the GOT form to be emitted preferably and that size considerations are not important. |
@compnerd - I'm very confused by your last response. I though in offline conversation we'd concluded that it was not legal to have a non-preemptible symbol in the GOT of the final link result? Are you now saying it is legal, and is only a size concern? |
No. A GOT entry may hold a link-time constant, a word relocated by R_*_GLOB_DAT, or a word relocated by a relative relocation. Linkers' GOT optimizations check many conditions when it can be performed. To the best of knowledge lld does the right thing for all ports (aarch64/powerpc64/x86-64) implementing GOT optimization.
Correction to the "DEFAULT" part. STV_DEFAULT is a necessary but insufficient condition. E.g. a defined STV_DEFAULT symbol in an executable or a -Bsymbolic linked shared object is non-preemptible.
Yes.
Disagree. GOT is appropriate here. Clang's aarch64/x86 ports and GCC's aarch64 port are good references. GCC x86-64 is somewhat borked due to a -fpie The key idea is that linker optimization/relaxation is optional (see my reply on Jul 8, 2021). This choice makes visibility less of an optimization approach, but STV_HIDDEN/STV_INTERNAL visibilities still have some effects. FWIW I don't really see the rationale behind objection to GOT for (all visibility) undefined weak symbol in -mcmodel=medany mode. We should also take into account the fact that The undefined weak symbol usage is quite specific to ELF and rarely used in portable software (https://maskray.me/blog/2021-04-25-weak-symbol).
I think this is the only feasible direction. |
@compnerd code size is important, but weak symbol should not be big portion of the program in general. |
@rui314 did you mind take a look about this? Just want to make sure every linker expert is here and think it make sense :) |
Re: GOT, I don't think there's a soundness issue. The linker can represent a hidden weak undefined symbol in all cases:
This is exactly what we do in mold, and I believe that's also true to lld. Regarding the original topic of this issue, I actually do not have a strong opinion. I agree with @MaskRay and others that linker relaxation should be optional. It feels like a program that doesn't work with the linker |
@rui314 I agree with the sentiment that linker relaxations should be optional, but the fact is, as you said, "the damage has already been done". This seems like a pretty trivial relaxation. I think that at a more philosophical level, using the GOT for the reference feels like the symbol participates in dynamic linking, when a hidden visibility symbol is supposed to not participate in dynamic linking. While to the letter, it seems to violate the spirit IMO. While I do not know of any loader that reject or may possibly mishandle the approach, it does seem like it is generating something more permissive than necessary when there is the possibility of something somewhat simple being able to prevent that. |
Sorry for raising a functional concern that appears to have been based on a false premise. Apparently, I should have cross checked my facts better. Sorry for the noise. The remainder of this is a summary of my current understanding. I am by no means an expert here, so hopefully my paraphrasing is correct enough not to be further confusing. The sentence in my prior post "To my knowledge, any symbol exposed in the GOT is preemptible. " turns out to be untrue. @rui314 nicely explains how non-preemptible symbols are handled by the dynamic loader if placed in the GOT. As such, to the best of my knowledge, the GOT lowering is a valid and sound lowering for all weak references. As such, the wording in the current psABI document being proposed for ratification is correct. There might be room for improvement in wording, but nothing I consider blocking for ratification. What we left with is a mixture of code size, performance, and a stylistic concern. I personally find the code size and performance argument largely unconvincing. However, there's also an implementation strategy which seems to completely nullify them at low cost. Does anyone see an reason not to simplify implement an (entirely optional) relaxation for converting GOT sequences to absolute or pc relative when legal? My understanding is that the result of the static link should be largely identical to the current gnu output if this was implemented. Is this challenging or tricky for any reason? I'm not going to comment on the stylistic bits. I'll leave that discussion to folks directly involved in linker implementations as I don't have any strong opinion. For LLD, I do think that we should implement GNU compatible handling of a weak references to non-preemptible symbols using the pc relative sequences. Not because the spec does or should require it, but because there's a decent amount of existing code out there compiled with the current gnu approach, and I'd like lld to be a drop in replacement for ld where possible. Having such a compatibility case doesn't appear to break the spec in any way, so I'd argue we should implement it for maximum compatibility - regardless of whether it was stylistically a good idea or not. |
It’s not tricky, I’ve suggested it exist in the past (as part of the justification for making this change to the psABI) and it would greatly benefit PIEs. It just doesn’t currently exist. |
That seems pragmatic to me, though I would've additionally advocated for making it part of the spec if it is implemented in both Binutils and LLD. I know it's a widely disliked feature, but it seems like it would minimize confusion to spec it if both major linkers support it. |
I think two things have been conflated. However, the first comment doesn't really make it clear what is being discussed. Now I forget what I talked to @compnerd about the Linux kernel vdso issue in 2021. I try to guess what people are discussing here. (a) When referencing an undefined weak symbol, use which kind of code sequence, PC-relative or GOT? I'll discuss them separately. The current GCC codegen is PC-relative. Below I will describe why it doesn't make sense.
With
In addition, if we have a code model supporting text+data spanning larger than 4GiB, I don't know whether it's still the linker's responsibility to paper over the code gen issue of the compiler. In the presence of R_RISCV_RELAX, I think in |
The binutils linker currently performs a zero-page optimization (near zero address relocations using
R_RISCV_PCREL_HI20
,R_RIRSCV_PCREL_LO2{I,S}
are rewritten to alui
+addi
form. lld does not perform this rewriting and will reject input which expects this.The text was updated successfully, but these errors were encountered: