Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solana BPF v2 #20323

Open
12 of 18 tasks
Tracked by #28755
jackcmay opened this issue Sep 29, 2021 · 35 comments
Open
12 of 18 tasks
Tracked by #28755

Solana BPF v2 #20323

jackcmay opened this issue Sep 29, 2021 · 35 comments
Labels
runtime Issues related to runtime, BPF, and LLVM
Projects

Comments

@jackcmay
Copy link
Contributor

jackcmay commented Sep 29, 2021

Problem

Solana Bytecode Format is a variation of eBPF. Some of the architectural requirements are different from eBPF and what Solana needs. A new version 2 is needed to move away from BPF and meet the specific needs of Solana

Proposed Solution

From a documentation perspective SBFv2 will still be referred to as "SBF" and when calling tools like cargo build-sbf a default arch will be provided, v1 initially, then a switch over to v2.

@jackcmay
Copy link
Contributor Author

FYI @Lichtso @dmakarov

@Lichtso
Copy link
Contributor

Lichtso commented Oct 10, 2021

I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly.

We are breaking compatibility with BPF in a major anyway and we want to support other native architectures besides x86-64, such as ARMv8 AArch64 and maybe SPIR-V. These would require an entire data flow analysis and register allocation at least, because they are so different from x86-64 in their data-flow style and would not work as well as the current BPF to x86-64 conversion.

Furthermore, we could think a lot more about the final conversion steps and e.g. do things which BPF is not meant for like:

  • Staying in SSA or implicit operand addressing to simplify register allocation and avoid the need for data flow analysis
  • Staying in structured programming to avoid the need for control flow analysis
  • Variable length instruction encoding and text section compression
  • Prefetch-like hits for address translation
  • Hints for loop vectorization
  • Wide range of data encodings such as 128-, 256-, 512-bit values and vectors

@dmakarov
Copy link
Contributor

dmakarov commented Oct 10, 2021

I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly.

What does this mean in practical terms? What other changes to the BPF instruction set would you suggest (for now considering only supporting non-x86 ISAs)? Maybe some BPF instructions do not map directly to ARM instructions, but is it so radical that we need to redefine the BPF instruction set completely?

@Lichtso
Copy link
Contributor

Lichtso commented Oct 10, 2021

For the current BPF to x86-64 conversion it is really hard to do any optimizations, because most rely on static analysis that would not be necessary if that information was not thrown away by the previous compilation step.

For ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have. So not to waste them requires an entire reallocation of the registers and reorganization of the stack spilling. Again requiring a data flow and control flow analysis beforehand. All of this could also be skipped if we would not compile down to a specific register set in the first place.

For SPIR-V: The entire architecture works completely different. While you still program on what is seemingly a single thread, it is executed in bundles of 32 or 64 threads in lockstep. So control-flow-divergence has to be minimized. Also there are much more registers (thousands) for cheap context switching, but almost no cache in turn. Data is best touched only once.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 10, 2021

What does this mean in practical terms?
redefine the BPF instruction set completely?

What I suggest is not to fork & modify BPF any further but use something else entirely.
So far I looked at WASM and GNU lightning. Designing our own from scratch is also possible.

@dmakarov
Copy link
Contributor

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course. I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications.

SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V.

@dmakarov
Copy link
Contributor

What does this mean in practical terms?
redefine the BPF instruction set completely?

What I suggest is not to fork & modify BPF any further but use something else entirely. So far I looked at WASM and GNU lightning. Designing our own from scratch is also possible.

I agree this is indeed something to think about.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 10, 2021

SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V.

Definitely, while there have been some funny trials such as rust-gpu, we would really need to think the "more parallelism" thing through from end-to-end throughout the entire ecosystem, not just the VM and ISA.

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time.

You mean as a form of SSA, right?
What about an implicit operand addressing like a stack (FILO) or queue (FIFO) machine.
These might even reach a better instruction encoding as they don't need to encode their destination operands.

I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications.

That is the reason why I mentioned WASM. It has its shortcomings, but it works well over a wide range of hardware architectures.

@dmakarov
Copy link
Contributor

You mean as a form of SSA, right? What about an implicit operand addressing like a stack (FILO) or queue (FIFO)

Yes, this essentially echoes your suggestion of staying in SSA. By stack machine you mean we would redefine BPF abstract machine to be a stack machine the same as Java VM? I guess it's a possibility, but then we really need a good run-time binary translator (in rbpf) to optimize the code. I'm not sure why you're so concerned with instruction encoding -- is it to reduce the code size of on-chain programs?

@Lichtso
Copy link
Contributor

Lichtso commented Oct 11, 2021

is it to reduce the code size of on-chain programs?

Yes, I think the BPF 8-byte fixed instruction size is terribly wasteful. It might not be a concern for the Linux kernel, for us however, program size is limited by account size.

would redefine BPF abstract machine to be a stack machine

The queue (FIFO) variant might also do the trick and be more straight forward as the instruction schedule can stay in the same order.

Let's say we have a SSA form like this:

...
%3824 = add %3818, %3821
%3825 = add %3815, %3823
%3826 = mul %3824, %3825

To convert to the queue machine representation you simply:

  • Remove the destination operand as that implicitly is defined by the instruction index.
  • Encode the source operands as relative to the instruction index of the next instruction.
...
add %5, %2
add %9, %1
mul %1, %0

Not only does it need no destination operands, but also the numeric value of the source operands is bounded.
The instruction scheduler can achieve staying in the window of possible relative operand addresses with the same strategy as it uses for the window of possible short jumps.

@jackcmay
Copy link
Contributor Author

jackcmay commented Oct 11, 2021

There is room for both here, I think we should formally move away from "BPF" anyway. In the shorter term, the proposed changes described in this issue's description can provide benefits as well as pave the way for a more flexible evolution path for other SBF formats. Aka, SBFvX doesn't have to be anything like SBFv1 or 2.

@jackcmay
Copy link
Contributor Author

@dmakarov How long do you think the last 4 items in the issue description would take to implement?

@dmakarov
Copy link
Contributor

@dmakarov How long do you think the last 4 items in the issue description would take to implement?

The support for signed division I think we already have enabled in the llvm backend (i'll have to double check).

I started implementing the variable stack size and I believe the remaining work is in RBPF only. The same as now the programs do not manage the stack pointer, but RBPF does, I want to make RBPF manage both frame pointer and stack pointers. On the compiler side there will be only one additional instruction generated in function prologue and epilogue, that communicates the stack size to RBPF. The RBPF at run-time updates its internal data structures managing the frame and stack pointers, without exposing this to the executing program. Both frame and stack pointer are need to know the boundaries of the variable stack frame at all times. I estimate I would need another week to finish this work (but not this current week probably).

@dmakarov
Copy link
Contributor

The two other items input buffers and callx restriction I haven’t looked at yet. I need to understand better what it would take. Restricting callx to known targets is probably simple. What to do if the target is not known? Issue an error? Removing input buffers instructions also does not take much time, but what should be instead of these instructions? Are they not used at all? Then why is it necessary to remove them? Maybe we can discuss this in more detail?

@jackcmay
Copy link
Contributor Author

Removing those instructions cleans up the instruction set, turns them into invalid instructions, and removes the burden on the runtime to support these instructions.

@seanyoung
Copy link
Contributor

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course.

Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there.

The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 12, 2021

Register allocation is expensive

That depends on the quality you want. But for ahead-of-time, maximum performance I agree. Then again, in that case, we don't need to compile often so it can take more time.

A single pass can create SSA form, and it's not expensive.

We have that implemented here. Still does not solve the miss match in the number of registers between different native target architectures.

@dmakarov
Copy link
Contributor

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course.

Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there.

The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive.

My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time?

@seanyoung
Copy link
Contributor

My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time?

I assume the SSA form is needed to support spir-v, that's all.

In order to justify doing register allocation at runtime (with the extra overhead), there would have to be sufficient register spilling in BPF code. I haven't seen this in the BPF code I've seen.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 12, 2021

It is already buried in the depths of this conversation:

ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have.

The reason for register allocation is not spilling but simply utilizing all the registers ARM has, as we want to target that in the future as well.

@seanyoung
Copy link
Contributor

If the BPF code is not spilling, it has no need for extra registers.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 12, 2021

Do you have a representative body of BPF programs to analyze how much register pressure we currently have?

Even then, future programs could change everything.

@seanyoung
Copy link
Contributor

I've been staring a lot at generated BPF code and I haven't seen a lot of spilling. The eBPF was designed explicitly to avoid the runtime register-allocation problem; changing this is should not be based on a hunch.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 12, 2021

The eBPF was designed explicitly to avoid the runtime register-allocation problem

Keep in mind that ARM is RISC, it has no memory operands unlike x86-64.

changing this is should not be based on a hunch

That is certainly true. And I think we need a body of representative BPF programs to benchmark and analyze anyway if we want to design driven by our needs.

@Lichtso
Copy link
Contributor

Lichtso commented Oct 25, 2021

Support for signed division

It just occured to me that signed integer division would introduce another edge case:
isize::MIN / -1 which overflows as the result is isize::MAX + 1.

@seanyoung
Copy link
Contributor

Support for signed division

It just occured to me that signed integer division would introduce another edge case: isize::MIN / -1 which overflows as the result is isize::MAX + 1.

Absolutely. Now the code emitter needs two compares and branches before emitting idiv. div is already an expensive instruction (in cycles). This becomes even worse with signed div.

There is possibly a case to be made that the bpf code can deal with signed div like it has to do now.

@Lichtso
Copy link
Contributor

Lichtso commented Nov 22, 2021

We should try to get the linker to concatenate similar sections (such as all read-only sections) and place them consecutively in virtual address space, so that we don't need to allocate, copy and fill the gaps with zeros here: https://github.com/solana-labs/rbpf/blob/cd37c0bf61dcce22fa48b594005237eeb130af76/src/elf.rs#L430.

@bernieblume
Copy link
Contributor

Very interesting discussion. Any update on that? Does anyone know the reason why the initial Solana design ended up being based on eBPF rather than WASM, given the fact that there are a bunch of other, pre-existing smart contract chains that are WASM-based. Was it just the Solana founders' familiarity with eBPF?

@Lichtso
Copy link
Contributor

Lichtso commented Sep 29, 2022

sub imm is kind of useless as it can be represented by add imm with a negative intermediate as well. However, swapping the operands so that the register is subtracted from the immediate would make sense.

neg reg could then also be easily replaced by sub 0 reg. Additionally, That would make overflow conditions and sign extensions more consistent.

@pgarg66 pgarg66 added the runtime Issues related to runtime, BPF, and LLVM label Oct 26, 2022
@Lichtso
Copy link
Contributor

Lichtso commented Nov 2, 2022

I finished the task "Restrict the callx jump targets to only allow known symbols": solana-labs/rbpf#397

However, there are some new ideas what to add to SBFv2:

  • BTF for type tracking at load time. Can potentially be used for skipping memory access checks at runtime.
  • Restricting jumps to their local function.
  • Removing le16, le32 and le64 (see Consider renaming le16, le32, le64 rbpf#354)
  • Removing sub imm or swapping the register and immediate operands roles.

@Lichtso
Copy link
Contributor

Lichtso commented Jul 17, 2023

In the last two weeks I spent some time refactoring, cleaning up and testing SBPFv2 in RBPF. During that I discovered multiple issues that we should consider changing.
I think points 1, 3 and 4 should definitely be implemented before releasing SBPFv2 in program-runtime-v1. The rest is only required for program-runtime-v2 but IMO it would still make sense to add it now so that we can get more testing exposure.

Proposed Solution

  1. In order to speed up and harden the ELF parsing we would ignore relocations and instead apply all of them in lld beforehand. Relocations may still be emitted for debugging purposes or future use. The issue with relocations is that they can modify any byte in the ELF, including the headers and other sections and thus influence the rest of the parsing process.
    Currently, if the output type is set to "shared" then lld outright refuses to --apply-dynamic-relocs and only supports --emit-relocs. This needs to be fixed.

  2. In order to unify the target addressing scheme of call and callx and allow both to address internal and external functions,
    we would switch from relative addressing to hash keys for internal calls too, just as we did for external calls (see static syscalls).

  3. In llvm the code emitter for the call instructions encoding would be changed:

  • from src=1 to offset=0 for internal function calls
  • from src=0 to offset>0 for external function calls

That way the offset and imm fields together form an 56 bit function address:

  • The MSB 16 bit (from offset) mark the program the address belongs to: 0 is self and everything higher is the index in the list of DT_NEEDED entries plus one.
  • The LSB 32 bit (from imm) are the hash key inside the program
  1. Previously the function registry was the unified set of addresses which are relative call targets and relocation targets. With both of these gone, we would instead build the function registry directly from the dynamic symbol table. For this, lld should build the dynamic symbol table from local and global functions and nothing else.

  2. In order to avoid having to hash all symbol names when parsing the ELFs, we would instead encode the hashed value in the name field of the symbols in binary directly. This would only apply to the dynamic symbol table and the normal symbol table would still use string names for debugging.

  3. In order to link against external programs lld would emit a list of DT_NEEDED entries with base58 encoded pubkeys (instead of file names).

@ripatel-fd
Copy link
Contributor

ripatel-fd commented Jul 17, 2023

@Lichtso All of the above sounds great if we stick with ELF. I had a few thoughts as well:

we would switch from relative addressing to hash keys for internal calls too

This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches.

Instead, I would go the other direction:

  • call: Internal function call, relative addressing via immediate
  • callx: Internal indirect function call, relative addressing via register
  • farcall (new opcode): External function call, hash addressing via immediate

we would instead encode the hashed value in the name field of the symbols in binary directly

Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?

DT_NEEDED entries with base58 encoded pubkeys

Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess.

Let's use this opportunity to specify strict validation rules for the number of entries of the .dynamic section and the order of these entries (unless this is unreasonably complicated). There should probably be some versioning mechanism for how external imports get resolved. We could use PT_INTERP to point to a dummy loader such as /lib/ld-solana-sbfv2.so.2.0.0.

Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try.

@Lichtso
Copy link
Contributor

Lichtso commented Jul 17, 2023

This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches.

Good point, at least for interpreters I guess.

Instead, I would go the other direction

We already do distinguish between what you refer to as call and farcall with the src field in the call instruction. Instead what we need is a way for dynamic calls (the callx instruction) to target external functions. How about the following slight modification, which would still unify call and callx instructions and avoid an additional lookup for internal function calls:

There is an 58 bit unified address space for all functions:

  • internal calls: 16 MSBs are zero, 32 LSBs are the absolute IP / PC

  • external calls: 16 MSBs are the DT_NEEDED index plus one, 32 LSBs are the hash key

  • call instruction encoding: 16 MSBs of the unified address in offset, 32 LSBs of the unified address in imm

  • callx register content: 16 MSBs fixed zeros followed by 58 LSBs of the unified address

Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?

We could either encode the u32 key directly where the index into the string table would be, or Hex-ASCII encode it as 9 characters (including NULL termination) in the string table. The later would be slightly less efficient because of the indirection but keep compatibility with other tooling such as disassemblers, readelf etc.

Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess.

Hex is fine too, base58 is just a bit nicer for debugging but probably not worth it.

Let's use this opportunity to specify strict validation rules.
Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try.

I think the consensus with our team ATM is that we stick with ELF to keep the tooling reach, but restrict it down as much as possible for ease of parsing.

@ripatel-fd
Copy link
Contributor

@Lichtso Mocking a flat address space is an interesting solution. I wasn't aware that code might need to indirectly call into another object, so that makes sense. That considered, I agree with having both call+callx opcodes handle "near" and "far" calls, and prefer this solution over a mix of hashes and relative addresses.

external calls: 16 MSBs are the DT_NEEDED index plus one, 32 LSBs are the hash key

Using a hash as the lower bits of an address seems a bit exotic ... But not all that new in eBPF land, where low addresses map to kernel function calls.

The remaining issue is the use of binary/opaque identifiers where ELF only supports strings. Ideally, we would find a solution that works the same for symbol hashes and program addresses. Seems like we could either (1) specify a string encoding and use the ELF standard structures, or (2) roll our own custom dynamic imports table and call destination table. I begrudgingly prefer 1.

@Lichtso
Copy link
Contributor

Lichtso commented Jul 18, 2023

@alessandrod Pointed out to me, that we don't have to hack the symbol hashes into the symbol table or symbol strings directly. Instead we can use the DT_GNU_HASH section to define hash values for the symbols in the ELF with almost no overhead. He also recommended the paper Drepper, Ulrich. "How to write shared libraries." on the topic of linking and shared objects in ELF.

Also, instead of using hash keys in the 32 LSBs of a function pointer, we could use the index into the symbol table. This has the tremendous advantage that without hashing, there also can't be any collisions, which would otherwise be a problem at runtime in the function pointers. An implementation could either lookup the target IP/PC in the symbol table of the selected dependency or allocate and initialize a PLT (procedure linking table) at the beginning of a transaction and do lazy loading (doing the lookup once and then patching the PLT so the next time the same entry is used it becomes a single indirect call).

This approach would mandate that the order of exported symbols in a program stays stable over the course of recompilations / redeployments. During redeployment we have to compare the replacement of a program against its previous version anyway in order to make sure that the existing function signatures stay the same. Thus, this step could also be used to reorder the symbol table accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
runtime Issues related to runtime, BPF, and LLVM
Projects
No open projects
SBF
In progress
Development

No branches or pull requests

7 participants