Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add large code model information. #388

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kuanlinchentw
Copy link

Hi,

This PR add description about large code model.
I was wondering if we need large+fpic model.
In general, position independant code model puts external symbol addresses into the GOT table.
Is there any case that we have to layout GOT table far away from code over +-2GB?

riscv-elf.adoc Outdated Show resolved Hide resolved
@rui314
Copy link
Collaborator

rui314 commented Sep 27, 2023

I think I'd prefer to define a set of relocations to materialize a 64-bit address with four instructions and let the linker to relax it to 1 to 3 instruction depending on the offset to the materialized address. That approach is easier to implement than the address pool and doesn't need a writable text segment.

I'd also think it could be faster than reading addresses from the address pool because 1) the processor could fuse 3 or 4 instructions into a single macro-op, and 2) loading an address from the address pool is just a waste of resources if the materialized address happens to be not too far from PC.

@kito-cheng
Copy link
Collaborator

@rui314 I am not sure if we can generate any arbitrary 64 bit address within 4 instruction? did you mind share the instruction sequence?

@rui314
Copy link
Collaborator

rui314 commented Sep 27, 2023

@kito-cheng Apologies, we can't materialize a 64-bit value with four instructions in RISC-V. We actually need six instructions to, for example, load a value from an arbitrary 64-bit address as follows:

lui   t0, <highest20>
addi  t0, t0, <higher12>
slli  t0, 32
auipc t1, <hi20>
addi  t1, t1, t0
ld    t1, <lo12>(t1)

which can be relaxed to the following 5 instructions if the symbol is within ±2^44 bytes

addi    t0, zero, <higher12>
c.slli  t0, 32
auipc   t1, <hi20>
addi    t1, t1, t0
ld      t1, <lo12>(t1)

and of course to the following two instructions if it's within ±2GiB.

auipc   t1, <hi20>
ld      t1, <lo12>(t1)

It looks to me that the RISC-V psABI's design choice to allow the linker to shrink the section really shines for this use case.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 27, 2023

Creating new ABIs that only support position-dependent code seems like a bit of a questionable thing to be doing in this day and age

@kuanlinchentw
Copy link
Author

kuanlinchentw commented Sep 28, 2023

I think using constant pool for large model doesn't cost so much. Because compiler can use anchors to tag variables, and load each variable just by its offset from the anchor.
Ex:
If we want to get values of global variables A and B. We don't have to load constanct pool entries twice for A and B.

auipc t0, hi20(.LC0)
ld       t1, t0, lo12(.LC0)  
lw      a4,0(t1) 
lw      a0,4(t1)  
.LC0:
       .dword  .LANCHOR0
       .bss
       .set    .LANCHOR0,. + 0
a:
       .zero   4
b:
       .zero   4

@rui314
Copy link
Collaborator

rui314 commented Sep 28, 2023

I know there are many extremely large programs out there that might already need the large code model, but to my knowledge, most of these programs are server-side and run in datacenters. They naturally need to be built as position-independent executables, and their text segments need to be read-only (or execute-only if possible). This made me wonder about your motivation to define a position-dependent-only ABI in the first place.

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

@MaskRay
Copy link
Collaborator

MaskRay commented Sep 28, 2023

I have some notes about large code models in aarch64/powerpc64/x86-64: https://maskray.me/blog/2023-05-14-relocation-overflow-and-code-models#aarch64-code-models

I know that certain JIT programs may use large code models, possibly just the position-dependent form.

I think using constant pool for large model doesn't cost so much. Because compiler can use anchors to tag variables, and load each variable just by its offset from the anchor.

Agree.

For server side large x86-64 applications, they can use the medium code model. This larger range makes it unlikely for AArch64 to encounter relocation overflow issues before the binary becomes excessively oversized for x86-64.

@aswaterman
Copy link
Contributor

to my knowledge, most of these programs are server-side and run in datacenters. They naturally need to be built as position-independent executables, and their text segments need to be read-only

Without commenting on the merits of this particular code model, I'll remark that there is a distinct and very real use case: RV64 embedded systems, which might not consume that much memory in total but need to cope with a sparse address space. The text/rodata might be separated by gigabytes from the absolute-addressed I/O, and there might be multiple regions of each. There's no virtual memory, so it isn't possible to remap the relevant regions to improve virtual spatial locality.

@kuanlinchentw
Copy link
Author

kuanlinchentw commented Oct 2, 2023

Actually, using constant pools as the large code model can generate position-independent executables. It only needs the static linker to leave dynamic relocations for the loader or the memery manager to add the offset when executables are remapped.
In my first comment, I was just wondering if there is the real case that we need large+fpic.

@jrtc27
Copy link
Collaborator

jrtc27 commented Oct 2, 2023

Yes, constant pools are equivalent to a hand-rolled GOT.

@kuanlinchentw
Copy link
Author

Yes, constant pools are equivalent to a hand-rolled GOT.

Yes. It's a nice description. Thanks.

@rui314
Copy link
Collaborator

rui314 commented Oct 2, 2023

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

@kito-cheng
Copy link
Collaborator

kito-cheng commented Oct 2, 2023

lui t0,
addi t0, t0,
slli t0, 32
auipc t1,
addi t1, t1, t0
ld t1, (t1)

Can use lui rather than auipc? I think all using lui would be easier to shared the high-part (first 5 instruction)? that should be able let compiler share the high-part between different low-part?

Use auipc we may either enforce whole instruction sequence must together or has a relocation let last instruction point to the auipc instruction like PCREL_LO12_*.

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

I involved the design and implementation of this code model when I still collage with @kuanlinchentw, so I guess I can give few detail from my brain dump: that design come with several advantages: 1) simple to implement, because it can be borrow the implementation from AArch64 :P, 2) NO new relocation required.

However the disadvantage is obviously: 1) every address need load from constant pool, 2) the pool has duplicated entries.
But we think the disadvantage can be ignore in most use case of large code model, since it mostly used when MMU-less situation, and also we have ePIC proposal, that could address some special use case in embedded world.

IIRC, long instruction sequence scheme also has discussed before in somewhere (publicly?), but it just come with more overhead to implement: new relocation and new linker relaxation, also psABI TG isn't exist in that moment, so we are trying to prevent touch psABI as possible at that moment.

@kuanlinchentw
Copy link
Author

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

As @kito-cheng mentioned, It's easy to implement at the compiler veiw, and it doesn't need to modify binutils.
For compiler, each variable access can be a dependent load intruction after setting the anchor value.
This can avoid using lots of pseudo intructions that may not scheduled apart.
We might consider the way that using a set of relocations to materialize a 64-bit address before.
But there is a trade-off between the compiler scheduler and the linker relaxation.
If the compiler expands the instruction sequence to schedule, it's hard for the linker to relax.
Even if linker can recognize the sequence and relax, the delete instructions may affect the schedule result.
And the disadvantages as @kito-cheng mentioned, I think it's still an issue.
Obviously, it waste the space to save redundant entries. Maybe the compiler can generate the mergable constant sections to reduce the harm.

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

If no new feature is required for it, what's the point of adding a new section to the psABI document for it? Does AArch64 psABI has a section for their counterpart?

@kuanlinchentw
Copy link
Author

If no new feature is required for it, what's the point of adding a new section to the psABI document for it? Does AArch64 psABI has a section for their counterpart?

It need to add a new option for code model just like medany and differenct code generations.
Yes. AArch64 defines small, kernel, medium and large model, and there is a section about code model.

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

I couldn't find a section in https://github.com/ARM-software/abi-aa/blob/844a79fd4c77252a11342709e3b27b2c9f590cf1/aaelf64/aaelf64.rst about how to use a constant pool to load an object's address from memory. Could you share the URL?

@kuanlinchentw
Copy link
Author

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

And which code model? It looks like the "large" code model in the AArch64 psABI is different from this proposal because the AArch64's large code model requires that GOT is within 2 GiB from the text segment and seems like addresses are read from GOT.

@kuanlinchentw
Copy link
Author

And which code model? It looks like the "large" code model in the AArch64 psABI is different from this proposal because the AArch64's large code model requires that GOT is within 2 GiB from the text segment and seems like addresses are read from GOT.

I think you can find example at https://github.com/ARM-software/abi-aa/blob/2982a9f3b512a5bfdc9e3fea5d3b298f9165c36b/sysvabi64/sysvabi64.rst#get-the-address-of-a-symbol-defined-in-the-same-elf-file

I think the distance of GOT means the literal pool not normal GOT. Because it doesn't support PIC.
image
image
image

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

If "GOT" in the documentation doesn't mean the .got section, that's super confusing, but if that's the case, that's their problem and not ours. Thank you for pointing that out.

@kuanlinchentw
Copy link
Author

kuanlinchentw commented Nov 1, 2023

Large code models usually deal with both code and data. It's fine not to deal with large amounts of code, but it should be called out. Are we going to utilize range extension thunks to implement the large code model? I have some notes on https://maskray.me/blog/2023-05-14-relocation-overflow-and-code-models

I think range extension thunk is worth to implement because it can reduce unnecessary loads for non-large function calls. But it need the linker support. Maybe we can list in TODO list?

@kuanlinchentw
Copy link
Author

In the PPC64 ELFv2 psABI, the r2 register always holds the address of the GOT section. Consequently, obtaining an address stored in the GOT can be done in an r2-relative manner, even when the GOT is out of range. This effectively resolves the problem of large code models. However, this approach introduces a challenge: how to set r2 to the correct address. Since the GOT exists in each ELF file, r2 may have different values for different functions unless the ELF file is a statically-linked executable.

I think this approch is too complicated to implement because it have to change current ABI. Therefore, it may cause compatible issues.

@kito-cheng
Copy link
Collaborator

I think so far we have 3 different approaches to implement the large code model, and one optimization for the function call.

Let me summarize the pros and cons of those 3 approaches:

  Size Performance Relaxable Compatibility with other code model Implementation effort PIC support
Constant pool Medium Bad *7 Yes*3 Yes*6 Low*4 No
Long instruction sequence Large*1 Bad*2 Yes Yes Medium Yes
TOC pointer Small Good Yes No Medium*5 Yes

*1 Most cases should be relaxable, then the code size is not really so huge.
*2 Most cases should be relaxable, then the performance impact is not really so huge.
*3 Harder than other options, we may did something like #397, but maybe need to add a few new relocations.
*4 Andes folk already offered the implementation for GCC and LLVM, and verified for years.
*5 I am not sure the exact implementation effort, but I guess that should be similar to GP-relaxation.
*6 It is also compatible with older toolchain releases since the current proposal didn't introduce new relocation types.
*7 Could be improved by linker relaxation, but not included in this proposal yet.

The capability of PIC support may not be a must requirement, one reference from AArch64 is they don't support PIC with the large code model.

And compatibility is important, otherwise it can only be enabled on those systems which have full control like Android or embedded system projects, or build a multi-lib for the large code model, which is not ideal.

So my thought is that a constant pool or long instruction sequence are preferred over TOC pointer schemes. I don't have a strong opinion on either constant pool or long instruction sequence scheme, or one thought in my mind: let we have both, and use command line options to decide which favor, one reason is I could imagine constant pool may faster than long instruction sequence on low end core, but get opposite situation on high end core, so having both maybe an good idea.


Range extension thunks is an add-on, it can combine with all 3 approaches, so we can implement that no matter which approach, one decision point here is should we make it mandatory as part of the large code model, this will add few implementation effort, but let us get better code gen for long jump.

@aswaterman
Copy link
Contributor

aswaterman commented Dec 7, 2023

So my thought is that a constant pool or long instruction sequence are preferred over TOC pointer schemes.

I agree that the simpler interoperability story of constant pools and long instruction sequences disfavors the TOC.

I don't have a strong opinion on either constant pool or long instruction sequence scheme, or one thought in my mind: let we have both, and use command line options to decide which favor, one reason is I could imagine constant pool may faster than long instruction sequence on low end core, but get opposite situation on high end core, so having both maybe an good idea.

FWIW, the 52-bit long-instruction-sequence scheme (lui + c.slli + auipc + c.add + ld) will perform no worse than the constant-pool version on superscalars if the load latency is >= 2 cycles, and it will perform better as the load latency grows. By contrast, the constant-pool scheme will usually perform better on single-issue processors, assuming the constant pool is a cache hit (or if caches aren't being used at all).

@rui314
Copy link
Collaborator

rui314 commented Dec 10, 2023

FWIW, the long instruction sequence is likely a preferred method for calling a function when using CFI (https://github.com/riscv/riscv-cfi). With CFI, it’s necessary to have a "landing pad" instruction at the beginning of any function that might be called indirectly through a function pointer. Consequently, all functions would require a landing pad in the constant pool scheme, which could significantly increase the number of gadgets an attacker might exploit.

@aswaterman
Copy link
Contributor

aswaterman commented Dec 11, 2023

@rui314 I think that would be a problem in either scheme, since in either case the call sequence will end in a JALR instruction. (Even though the long instruction sequence isn't loading the pointer from memory, the use of a JALR still engages the CFI ISA mechanism that requires a landing pad.)

However, the CFI mechanism offers an escape hatch in the form of a "software-guarded jump". If rs1=x7 (for tail calls) or x1/x5 (for regular calls), then no landing pad is expected. The software contract is that such a jump is only used when the pointer is known to be safe, which it should be in either case here. (The assumption is that the constant pool is in an rodata or text section so can't be corrupted.)

So, I don't think CFI is a problem here, aside from constraining register allocation.

@kito-cheng
Copy link
Collaborator

@rui314 that's good point, I think for this case we can use software-guarded jump, then we are free from that

@kito-cheng
Copy link
Collaborator

@rui314 @MaskRay @Nelson1225 need some input from linker experts, I am not sure the implementation complexity of range extension thunks? I think it worth to use that for large code model IF it not too complicate to implement, also I am not sure does here some corner case we need to handle very carefully? e.g. super large single text section, which is larger than 4G.

@rui314
Copy link
Collaborator

rui314 commented Dec 12, 2023

Implementing range extension thunks shouldn't be too hard, but that would work only for code. If compiled machine code assumes both code and data are within PC ± 2GiB, range extension thunks can solve only a half of the problem.

riscv-elf.adoc Outdated Show resolved Hide resolved

The `large` code model allows the code to address the whole RV64 address space.
Thus, this model is only available for RV64. By putting object addresses
into literal pools, a 64-bit address literal can be loaded from the pool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think function call and getting the address of a symbol should be placed in different paragraphs.

The description can interleave code sequences with descriptions. The description can state the supported offset range.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping about this comment. I think the description can be improved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.
I've separated the descritions.
Please review if it's ok. Thanks

@kito-cheng
Copy link
Collaborator

Some comment from the last LLVM sync meeting:

Constant pool and long instruction sequence are both has it own use case, so we may allow both scheme and let user to choose which scheme should be used by some option, also same for function call.

Also some other comment from the last psABI call:

We didn't (officially) reserve intra-procedure-call scratch register, AArch64 has listed r16 and r17 ad IP0 and IP1, and explicitly say they may clobber during procedure call, that might be an issue when we implement range extension thunks .

However we actually already use t0, t1, t2 and t3 at PLT stuffs, so we could use same set of register to implement that, then we should specify that explicitly in the psABI, the only concern is it will seem like an incompatible ABI change, but this is less risky since it's kind of de facto behavior.

@jrtc27
Copy link
Collaborator

jrtc27 commented Dec 21, 2023

Some comment from the last LLVM sync meeting:

Constant pool and long instruction sequence are both has it own use case, so we may allow both scheme and let user to choose which scheme should be used by some option, also same for function call.

Also some other comment from the last psABI call:

We didn't (officially) reserve intra-procedure-call scratch register, AArch64 has listed r16 and r17 ad IP0 and IP1, and explicitly say they may clobber during procedure call, that might be an issue when we implement range extension thunks .

However we actually already use t0, t1, t2 and t3 at PLT stuffs, so we could use same set of register to implement that, then we should specify that explicitly in the psABI, the only concern is it will seem like an incompatible ABI change, but this is less risky since it's kind of de facto behavior.

No; using custom calling conventions within an object has always been allowed (and that’s a thing that’s done across architectures), but range extension thunks clobbering registers that weren’t previously reserved for it would break that. It’s only safe to do in the PLT case because people know PLTs exist and they need to be careful.

@kito-cheng
Copy link
Collaborator

No; using custom calling conventions within an object has always been allowed (and that’s a thing that’s done across architectures), but range extension thunks clobbering registers that weren’t previously reserved for it would break that. It’s only safe to do in the PLT case because people know PLTs exist and they need to be careful.

Yeah, fair enough, so I think let moving forward without range extension thunks, then extend that later with necessary changes (e.g. adding new tag) if needed

Copy link
Collaborator

@kito-cheng kito-cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am intend to moving this forward and then extend this further later, e.g. add long instruction sequence scheme, one concern is that will require adding new relocation and extra implementation work, so it should split into another step to do to prevent this stuck here too long.

For now, I think it would be great to add few note like: "NOTE: We intend extend the large code model with different code generation strategy in future." to mention we will add long instruction scheme in future, also range extension thunk may included in future.

@kito-cheng
Copy link
Collaborator

ping @MaskRay @rui314 , would you like to give some blessing to moving this forward?

riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
@sorear
Copy link
Collaborator

sorear commented Feb 18, 2024

My biggest concern here is that we're allocating the name "large" and creating a compatibility promise for a short-term code model. If in the future we have a fully designed large model, gcc won't be able to switch to it for -mcmodel=large because that will regress functionality for anyone with an old binutils, so the new, better code model will be stuck with a worse name.

@kito-cheng There is a fourth option - use a real GOT. RISC-V does not have a meaningful concept of a GOT base, so there's nothing forcing the GOT to be contiguous; interleave text and GOT in 4 GiB chunks to support GOTPCREL_HI20 relocations in the large model. Obviously this won't work if you're generating a.out and need RX and RW memory to be a single contiguous range each, but it should work for ELF.

I'm a strong supporter of range extension thunks and implemented them for the riscv Go linker a while ago. Ideally we would support them with both 4-byte and 8-byte call sites, which means we need a new relocation type JAL_THUNK anyway, so adding CALL_THUNK might not be so bad.

@sorear sorear mentioned this pull request Feb 20, 2024
@kivoimusa
Copy link

kivoimusa commented Feb 21, 2024

I think I can post this here for some brief:
Am running Ubuntu 20.04 L.T.S on AMD 64-bit processor and I got a compiler error when executing my linked embedded python into caffe framework. The compiler tells me to recompile with -fPIC. This causes memory relocation and I don't know why the linker and the compiler are failing to use a linked static library.
My caffe build is a large code-base of more than 32GB. The program is built from source as per the manual. I have tried to look for solutions on Stack-overflow with no success.
I would really appreciate for your assistance.
Kivoi Musa

@jrtc27
Copy link
Collaborator

jrtc27 commented Feb 21, 2024

I think I can post this here for some brief: Am running Ubuntu 20.04 L.T.S on AMD 64-bit processor and I got a compiler error when executing my linked embedded python into caffe framework. The compiler tells me to recompile with -fPIC. This causes memory relocation and I don't know why the linker and the compiler are failing to use a linked static library. My caffe build is a large code-base of more than 35GB. The program is built from source as per the manual. I have tried to look for solutions on Stack-overflow with no success. I would really appreciate for your assistance. Kivoi Musa

This is the specification for the RISC-V instruction set's ABI, and your 64-bit AMD processor is not a RISC-V processor; unless you're cross-compiling for RISC-V (doubtful?) you seem quite lost and this is not the place for this kind of question since it's for a completely different processor instruction set.

qihangkong pushed a commit to rvgpu/llvm that referenced this pull request Apr 18, 2024
Implement large code model for GlobalAddressSDNode, BlockAddressSDNode
and ExternalSymbolSDNode.

See discussion on
riscv-non-isa/riscv-elf-psabi-doc#388.

co-authored by: Kuan-Lin Chen <rufus@andestech.com>
@kito-cheng
Copy link
Collaborator

@sorear

I incline to accept current proposal with optional range extension thunk*1 support, we already have note say we may have other code generation strategies, so it let us have room to add more large code model variant in future, I am not really comfortable with the multiple GOT design, that's complicate and it would be challenge on the customized linker script to specify that.

*1 Add note to mention function call may use auipc+jalr sequence if linker support range extension thunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants