-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wazevo(arm64): trampoline for far calls on relocation #2169
Conversation
encodeMoveWideImmediate(movzOp, tmpReg, uint64(uint16(addr)), 0, 1), | ||
encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>16)), 1, 1), | ||
encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>32)), 2, 1), | ||
encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>48)), 3, 1), | ||
encodeUnconditionalBranchReg(tmpReg, true), | ||
encodeUnconditionalBranch(false, returnOffset-6*4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible that we can resolve the width of the jump earlier and either generate a smaller trampoline, with less instructions, or use a different strategy for encoding the address (for now I just wanted to see if the general approach was working)
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the direction looks good, but did you do some research on how other linkers/comilers deal with this?
Yes, I will add some notes inline in the code. Here's my summary. On ARM there are 3 types of "veneer" (trampoline/thunk in ARM parlance)
The main issue is where to put them. My first thought was to tack it at the end of the current function for simplicity, but how do linkers do it? Because the job of a traditional linker is to shuffle code around, it's cheap for them to generate new code and then point the branch to it; essentially, they can generate a new (internal) symbol. This online book ("Embedded Systems Security and TrustZone") gives a nice explanation of how this works. But in our case adding a new symbol/code section would mean adding a new Placing these new segments now represents a problem of its own; for instance:
In fact, these sections can be fused with the code of a function: this behavior can be controlled through linker scripts; for instance, the veneer for interlinking is joined with the So, tacking it to the bottom of the original function is actually a behavior that can be controlled even in traditional linkers, and, in our case, it addresses the other problems; assuming that the bottom of the function is within range for a relative jump. I currently ruled out encoding the trampoline as a short-range thunk for simplicity: it's an optimization we can address later, and it was important to get to the bottom of this first. Another optimization might be detecting multiple invocations to the same function and jump to the same veneer. Finally, the text itself of the veneer: how do you load the address? There are a few possible approaches; in my case, again, for simplicity, I am |
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
I improved the PoC but it's not clean enough for reviewing, so I'll push the code later, but I'll write some notes here. I wanted to make sure that we do not waste too much space for trampolines that are not needed. In the code here we always allocate the space for a trampoline for each relocation, regardless if this trampoline will be needed (i.e. when the relocation does not require a far jump). Now, suppose that we have functions The solution is obviously to iterate more than once; but how many times? We could compute a fixed point, but instead:
|
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Pushed change following #2169 (comment); actually, slightly improved because I realized I do not need to do 3 separate passes, but only 2 are really necessary.
We don't even need to actually allocate the space ( This still needs proper tests. |
@@ -220,6 +220,7 @@ func (e *engine) compileModule(ctx context.Context, module *wasm.Module, listene | |||
totalSize := 0 // Total binary size of the executable. | |||
cm.functionOffsets = make([]int, localFns) | |||
bodies := make([][]byte, localFns) | |||
sourceOffsets := make([][]backend.SourceOffsetInfo, localFns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to append the sourceOffsets in the second stage, so we need to store be.SourceOffsetInfo()
somewhere. I don't love this change, but I could not think of anything better at the moment...
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Added a few test cases, will continue later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the right track I guess!
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
would you mind comparing the binary size and the compilation time perf before/after for stdlib binaries (Zig, Go, and libsodium would be fine)? |
I double-checked file sizes and, for all these benchmarks, they do not change (as expected), because all of these test cases are small. In short, the difference is small enough, in file size and negligible in time (notice that Original reported case
compiled size:
Other BenchmarksI have omitted file sizes here because they do not change. Zig. No file size impact, essentially no run-time impact.
libsodium. No file size impact, about 0.23% run-time impact.
Wasip1. No file size impact, essentially no run-time impact (0.02%).
|
r.Offset += int64(totalSize) | ||
e.rels = append(e.rels, r) | ||
} | ||
|
||
bodies[i] = body | ||
totalSize += len(body) | ||
totalSize += len(body) + e.machine.RelocationTrampolineSize(rels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic is broken here. Inside RelocationTrampolineSize, it checks rels[i].TrampolineOffset > 0
but TrampolineOffset is always zero at this point. In fact
--- a/internal/engine/wazevo/engine.go
+++ b/internal/engine/wazevo/engine.go
@@ -263,7 +263,11 @@ func (e *engine) compileModule(ctx context.Context, module *wasm.Module, listene
}
bodies[i] = body
- totalSize += len(body) + e.machine.RelocationTrampolineSize(rels)
+ s := e.machine.RelocationTrampolineSize(rels)
+ if s != 0 {
+ panic("MUST HIT")
+ }
+ totalSize += len(body) + s
if wazevoapi.PrintMachineCodeHexPerFunction {
fmt.Printf("[[[machine code for %s]]]\n%s\n\n", wazevoapi.GetCurrentFunctionName(ctx), hex.EncodeToString(body))
}
this obvious assertion doesn't get hit after running wazero run pg.wasm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates that the current code is not starting with the worst case size, but instead "smallest size" scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly me i must have introduced that bug in e.machine.RelocationTrampolineSize(rels) while writing and cleaning up the tests. Earlier it just multiplied by the size of rels
ok I think I will take this over. Thank you for your hard work and hints here @evacchi |
superseded by #2181 |
Fixes #2158, follows up to #2167.
Draft for feedback on the approach; also needs tests.
r.TrampolineOffset
) at the end of a function to accommodate a trampoline for far jumps (using BRL).tmp
reg -- movz/movk/movk/movktmp
resolveRelativeAddress
) we use relative jumps but at that time 1) we don't know the absolute address yet and 2) we can inject instruction at any point within the function because we are resolving the addresses before encoding.Moreover, BRL's 64-bit addressing should obviously be comfortable enough to reach any point in the executable.
This was manually tested against the reproducers (top post and others in the comments) in #2158 and verified to work, but I will add proper tests.
Signed-off-by: Edoardo Vacchi evacchi@users.noreply.github.com