Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Thread-Local Storage Descriptors (TLSDESC) #94

Closed
MaskRay opened this issue May 26, 2019 · 21 comments
Closed

Support Thread-Local Storage Descriptors (TLSDESC) #94

MaskRay opened this issue May 26, 2019 · 21 comments

Comments

@MaskRay
Copy link
Collaborator

MaskRay commented May 26, 2019

TLSDESC (-mtls-dialect=gnu2) improves traditional General Dynamic and Local Dynamic TLS models (-mtls-dialect=gnu). In the most common case that TLS variables are defined in initially-loaded modules, it simplifies the work in __tls_get_addr and (probably more importantly) switches to a custom calling convention that doesn't clobber any register ("preserve any registers they modify"). This speedup is significant.

The linker may relax TLSDESC code sequence to Initial Exec (targeting an executable, the symbol is preemptable) or Local Exec (target an executable, the symbol is non-preemptable) if applicable.

The initial (static) and outstanding (dynamic) relocation types for TLSDESC have to be defined, as well as how static relocation types are relaxed to Initial Exec and Local Exec models.

TLSDESC is currently available on x86, x86-64, arm and aarch64.

x86: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
ARM: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt (the published paper was referenced by ELF for the Arm 64-bit Architecture (AArch64) and ARM 32-bit)

As I understand it, TLSDESC is a strict improvement, so it might be worth defaulting to TLSDESC and probably deprecating relocation types for General Dynamic/Local Dynamic.

@aswaterman
Copy link
Contributor

Seems like a good idea, though a big chunk of work to specify and implement. Any chance you're volunteering? :)

@MaskRay
Copy link
Collaborator Author

MaskRay commented May 26, 2019

I'm still getting my feet wet with riscv toolchain (tracking the progress of musl-riscv, reading llvm/lib/Target/RISCV binutils-gdb/bfd/elf*-riscv.c, etc)... but I would be happy if I can help. I have some contribution to lld so I may help on the lld (llvm linker) side.

@richfelker
Copy link

One note based on comments from sorear on #musl where this was discussed earlier: since TLSDESC calls have a custom calling convention (usually with few or no clobbers) anyway, it would probably make sense to use a non-default link register for the call. This would avoid the need for the caller to save and restore it.

@aswaterman
Copy link
Contributor

Indeed, t0 is intended for this purpose.

@rui314
Copy link
Collaborator

rui314 commented Mar 5, 2022

It looks like no progress has been made since this was filed, but I'd like to see this to happen. Compared to the current TLS model, TLSDESC is strictly better; it's more generic, more efficient and simpler.

@aswaterman
Copy link
Contributor

I don’t think there’s any disagreement on the technical merits. Someone needs to volunteer to do the work.

Many of the folks who work on the toolchain are doing the things their employers think are the most pressing, and this one doesn’t seem to be reaching the top of anyone’s list. So, unless someone volunteers to do it out of the goodness of their hearts, it’ll probably be a few more years.

@rui314
Copy link
Collaborator

rui314 commented Mar 5, 2022

I understand that. I'm currently busy working on my mold linker (which by the way supports RISC-V), so I don't have a bandwidth to take this one, but if someone wants to take this, I'm happy to support.

@aswaterman
Copy link
Contributor

@rui314 mold looks like a great project!

@ishitatsuyuki
Copy link
Contributor

I'm planning to prototype on an implementation (with draft spec) in the following weeks. I'll keep you updated when I make progress.

@aswaterman
Copy link
Contributor

Cool!

@rui314
Copy link
Collaborator

rui314 commented Jan 25, 2023

I'm proud to say that this effort is sponsored by the mold project. We are interested in not only implementing but also improving the RISC-V psABI.

@ilovepi
Copy link
Contributor

ilovepi commented Mar 15, 2023

@rui314 Has there been any progress on this? Are there thing we can do to help?

@rui314
Copy link
Collaborator

rui314 commented Mar 16, 2023

@ishitatsuyuki is working on this.

@ishitatsuyuki
Copy link
Contributor

Hi, sorry for the silence, the last few weeks have been a little busy due to IRL matters but I’m still working on this. Currently trying to get a prototype working with the GNU toolchain.

@ilovepi
Copy link
Contributor

ilovepi commented Mar 16, 2023

No need for apologies, open source contributors shouldn't feel pressured to volunteer more time than they already do. So thanks for working on this.

Are there things we could do to expedite your work? For instance, we could help prototype things on the LLVM side, or help with code review (when ready). It may also be helpful to sketch out some of the details in this thread, like the set of registers you plan to use (probably just t0, a0, and maybe a scratch register?) and the new reloc types.

@ishitatsuyuki
Copy link
Contributor

The instruction sequence I'm thinking right now is something like below. It mirrors the AArch64 sequence, thanks to the fact that adrp and ldr are highly similar to the RISC-V counterpart, auipc and lw.

label:
	auipc a0, %tls_desc_pcrel_hi(symbol)          // R_RISCV_TLS_DESC_HI20
	lw    t0, a0, %tls_desc_pcrel_lo(label)       // R_RISCV_TLS_DESC_LO12_I
	addi  a0, a0, %tls_desc_pcrel_lo(label)       // R_RISCV_TLS_DESC_LO12_I
	jalr  t0, t0                                  // R_RISCV_TLS_DESC_CALL (label)

Relocations: More or less same as AArch64, with the exception that immediates in both lw and addi is covered by R_RISCV_TLS_DESC_LO12_I (they have same format in RISC-V but not AArch64).

Registers: Only t0 and a0 are clobbered. At least these two needs to be clobbered for a function call, so allocating to other registers is not supported.

Relaxation: 4 instructions should be enough to fit both IE and LE sequences mentioned in #122. But see below for fusing the load/store offset.

Remarks and design questions:

  • The argument to the resolver is the address of the descriptor, not the dereferenced value field. In other words, given

    struct TlsDescriptor {
      long (*resolver)(long);
      long arg;
    };
    

    The call is tls_desc.resolver(&tls_desc), not tls_desc.resolver(tls_desc.arg).

    This is inherited from the AArch64 ABI design. I could not find any public documentation about the rationale (in fact, most documentation about TLS descriptors seems to assume tls_desc.resolver(tls_desc.arg)), but the benefit of this seems to be that we only need to deal with a single address in the relocations, which makes reasoning slightly simpler.

  • Is there any opportunity to use the C (compressed) instruction subset? I didn't find any, but I haven't get myself fully acquainted with the C extension.

  • For relaxation to local exec, we have the opportunity to fuse the lower TP offset into the load/store instruction. What kind of relocation should we use to identify the load/store against the address? (The same problem likely exists for Define Initial Exec to Local Exec TLS relaxation #122)

  • Should we allow the compiler to interleave the sequence with other instructions? It should be possible to uniquely identify which instruction belongs to which TLS access, by making all but the first relocation to point to label. If so, allowing interleaving gives the compiler a little more opportunity in instruction scheduling.

@kito-cheng
Copy link
Collaborator

@rui314 @ishitatsuyuki really excite to hear you are planing on this!

Is there any opportunity to use the C (compressed) instruction subset? I didn't find any, but I haven't get myself fully acquainted with the C extension.

IIUC t0 and a0 is just a random picked register for current implementation, it could be any pair or registers? if so it would be great to using a0 and a1 rather than t0 and a0 since t0 can't be compress-able.

Should we allow the compiler to interleave the sequence with other instructions? It should be possible to uniquely identify which instruction belongs to which TLS access, by making all but the first relocation to point to label. If so, allowing interleaving gives the compiler a little more opportunity in instruction scheduling.

I would prefer to allow compiler to interleave the sequence if possible, and I am curious, does here benefit if we have fixing code sequence for the linker implementation side?

@ishitatsuyuki
Copy link
Contributor

ishitatsuyuki commented Mar 20, 2023

IIUC t0 and a0 is just a random picked register for current implementation, it could be any pair or registers? if so it would be great to using a0 and a1 rather than t0 and a0 since t0 can't be compress-able.

t0 is the alternate link register, and a0 is the register for the first argument to the function. Using a1 might be a bit surprising for storing the return address.

Rui suggested that the registers in the instruction sequence is unnecessarily constrained, though, so I'm thinking of relaxing it to something like this, where tX and tY can be any general purpose registers:

	auipc tX, %tlsdesc_pcrel_hi(symbol)         // R_RISCV_TLSDESC_HI20
	lw    tY, tX, %tlsdesc_pcrel_load_lo(label) // R_RISCV_TLSDESC_LOAD_LO12_I
	addi  a0, tX, %tlsdesc_pcrel_add_lo(label)  // R_RISCV_TLSDESC_ADD_LO12_I
	jalr  t0, tY                                // R_RISCV_TLSDESC_CALL(label)

Still, I think maybe the instruction sequence for TLSDESC should be kept uncompressed, otherwise this could potentially significantly make relaxation harder by restricting the instructions that it can be replaced with.

I would prefer to allow compiler to interleave the sequence if possible, and I am curious, does here benefit if we have fixing code sequence for the linker implementation side?

On second thought I think fixing the code sequence is not really necessary, and we probably want to give compiler the freedom of scheduling (and register allocation as mentioned above). So in the draft I'm writing right now I'll go without the scheduling restriction.


In other news, I'm currently working on a draft version of spec; it won't be ready for merge until I implement the prototype, but a draft PR should be a better place to discuss this.

@aswaterman
Copy link
Contributor

We should definitely not use a register other than t0 or ra for the link register, else hardware return-address stacks won't pick up the pattern.

@jrtc27
Copy link
Collaborator

jrtc27 commented Mar 20, 2023

For jalr t0, tY you want t0 to be t0 to get the RAS push and you want tY to be anything other than ra; both t0 and some non-link register hint will give you the same hinting.

@ishitatsuyuki
Copy link
Contributor

The spec draft is available at #373. Let me know about your thoughts.

jrguzman-ms pushed a commit to msft-mirror-aosp/platform.bionic that referenced this issue Sep 16, 2023
riscv-non-isa/riscv-elf-psabi-doc#94 was
merged.

Bug: google/android-riscv64#3
Test: treehugger
Change-Id: Ifc0c4d7408367f0372cec969d4ee6f84699bee30
jrguzman-ms pushed a commit to msft-mirror-aosp/platform.bionic that referenced this issue Oct 20, 2023
Bug: riscv-non-isa/riscv-elf-psabi-doc#94
Test: treehugger
Change-Id: I1580686c8381be7dfdb5d7684934a176e0d11d77
@MaskRay MaskRay closed this as completed Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants