wazevo(arm64): trampoline for far calls on relocation #2169

evacchi · 2024-04-02T20:20:48Z

Fixes #2158, follows up to #2167.

Draft for feedback on the approach; also needs tests.

Essentially, this adds some padding (r.TrampolineOffset) at the end of a function to accommodate a trampoline for far jumps (using BRL).
The trampoline will be written if the relative jump is outside the accepted range.
When such an invalid relative jump is detected, instead of jumping to the function we jump to the trampoline:
- 64-bit mov to tmp reg -- movz/movk/movk/movk
- BRL to tmp
- unconditional branch back to the origin
Note: This is currently adding such a padding for all relocated calls, so obviously it is not optimal (will continue working on it later then undraft. I think it should be doable with some minor refactoring); however there is only waste of space: we use the trampoline only for far jumps.
Also note: this is using a BLR. In other places (e.g. resolveRelativeAddress) we use relative jumps but at that time 1) we don't know the absolute address yet and 2) we can inject instruction at any point within the function because we are resolving the addresses before encoding.

Moreover, BRL's 64-bit addressing should obviously be comfortable enough to reach any point in the executable.

This was manually tested against the reproducers (top post and others in the comments) in #2158 and verified to work, but I will add proper tests.

Signed-off-by: Edoardo Vacchi evacchi@users.noreply.github.com

evacchi · 2024-04-02T20:22:10Z

internal/engine/wazevo/backend/isa/arm64/machine_relocation.go

+		encodeMoveWideImmediate(movzOp, tmpReg, uint64(uint16(addr)), 0, 1),
+		encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>16)), 1, 1),
+		encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>32)), 2, 1),
+		encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>48)), 3, 1),
+		encodeUnconditionalBranchReg(tmpReg, true),
+		encodeUnconditionalBranch(false, returnOffset-6*4),


it's possible that we can resolve the width of the jump earlier and either generate a smaller trampoline, with less instructions, or use a different strategy for encoding the address (for now I just wanted to see if the general approach was working)

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

internal/engine/wazevo/backend/isa/arm64/machine_relocation.go

mathetake

the direction looks good, but did you do some research on how other linkers/comilers deal with this?

internal/engine/wazevo/engine.go

evacchi · 2024-04-03T09:02:14Z

did you do some research on how other linkers/comilers deal with this?

Yes, I will add some notes inline in the code. Here's my summary.

On ARM there are 3 types of "veneer" (trampoline/thunk in ARM parlance)

inline, which is ARM-specific and it's a combination of an absolute jump and a command to switch between different instruction sets (Arm to Thumb; i.e. A32->T32 etc.) using the BX instruction; this is also called interworking
short-range (still relative jump, essentially like those we generate e.g. in resolveRelativeAddresses)
long-range (absolute, like this one)

The main issue is where to put them. My first thought was to tack it at the end of the current function for simplicity, but how do linkers do it?

Because the job of a traditional linker is to shuffle code around, it's cheap for them to generate new code and then point the branch to it; essentially, they can generate a new (internal) symbol. This online book ("Embedded Systems Security and TrustZone") gives a nice explanation of how this works. But in our case adding a new symbol/code section would mean adding a new functionOffset and mmapping a new segment.

Placing these new segments now represents a problem of its own; for instance:

we could add these new segments as we need them, incrementing the i in cm.functionOffsets[i]; e.g. say we are processing function i and we discover we need k trampolines, then we would generate those k sections and add them at i+1, ..., i+k,
but then we would lose the correspondence between cm.functionOffsets[i] and module.CodeSection[i] (as well as other similar cases), and that would cascade to the surrounding code.
Instead, we could add them to the end of functionOffsets, with j=len(functionOffsets), .... This however means that, depending on the order in which we generate and mmap them they could be even further than the original function that needed to be invoked.
Besides, this also adds more mmapped code to be collected and finalized.

In fact, these sections can be fused with the code of a function: this behavior can be controlled through linker scripts; for instance, the veneer for interlinking is joined with the .text section by the default linker script.

So, tacking it to the bottom of the original function is actually a behavior that can be controlled even in traditional linkers, and, in our case, it addresses the other problems; assuming that the bottom of the function is within range for a relative jump.

I currently ruled out encoding the trampoline as a short-range thunk for simplicity: it's an optimization we can address later, and it was important to get to the bottom of this first.

Another optimization might be detecting multiple invocations to the same function and jump to the same veneer.

Finally, the text itself of the veneer: how do you load the address? There are a few possible approaches; in my case, again, for simplicity, I am movz/movk/movk/movk'ing the address to the tmp reg which we know is safe to use. Another way to load the address is similar to brTableSequence, using adr and a constant pool. In some cases (when the address is especially large), adrp+add might need to be used instead (adrp addresses by 4KB-pages). E.g. the example in the book uses ldr r0, [pc, #8] which is essentially what adr r0, #8 would translate to.

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi · 2024-04-03T20:26:48Z

I improved the PoC but it's not clean enough for reviewing, so I'll push the code later, but I'll write some notes here.

I wanted to make sure that we do not waste too much space for trampolines that are not needed. In the code here we always allocate the space for a trampoline for each relocation, regardless if this trampoline will be needed (i.e. when the relocation does not require a far jump).

Now, suppose that we have functions fk, fk1, fk2 all adjacent to each other. Suppose that fk needs more space to accommodate a trampoline to fk2. If we do a linear scan, at this point, we can't know for certain the offset of fk2 because we don't know if also fk1 will require some trampolines (causing it to increase in size).

The solution is obviously to iterate more than once; but how many times? We could compute a fixed point, but instead:

we can do a first pass assuming the "worst case" scenario, where each relocation really needs a trampoline;
- this way, we can precompute the size of each function with all the trampolines (and thus their offsets) knowing that they can't grow further than that
we iterate over the relocations and flag those that really need a trampoline:
- now, for each function, if any of the offset diffs is within range, we know for certain that that trampoline won't be needed
- conversely, if any of the offset diffs is outside the acceptable range, we will just assume that a trampoline will be needed to avoid further iterations (we can further optimize the edge cases in the future)
we do a final pass where we actually allocate space at the end of each function and write the trampolines

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi · 2024-04-04T12:14:31Z

Pushed change following #2169 (comment); actually, slightly improved because I realized I do not need to do 3 separate passes, but only 2 are really necessary.

First precompute all the offsets including "maximal" sizes for trampolines
Scan forward, recompute the diffs, and account for the trampolines that are needed.

We don't even need to actually allocate the space (body = append(body, make([]byte, N)), because when we mmap and copy over the bodies, they are laid out at the given offset, which will account for trampolines. When we write the trampolines later we overwrite all the uninitialized bytes, so everything works.

This still needs proper tests.

internal/engine/wazevo/backend/machine.go

evacchi · 2024-04-04T12:22:18Z

internal/engine/wazevo/engine.go

@@ -220,6 +220,7 @@ func (e *engine) compileModule(ctx context.Context, module *wasm.Module, listene
 	totalSize := 0 // Total binary size of the executable.
 	cm.functionOffsets = make([]int, localFns)
 	bodies := make([][]byte, localFns)
+	sourceOffsets := make([][]backend.SourceOffsetInfo, localFns)


we need to append the sourceOffsets in the second stage, so we need to store be.SourceOffsetInfo() somewhere. I don't love this change, but I could not think of anything better at the moment...

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi · 2024-04-04T15:56:00Z

Added a few test cases, will continue later.

mathetake

on the right track I guess!

internal/engine/wazevo/backend/isa/arm64/machine_relocation.go

internal/engine/wazevo/backend/machine.go

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

mathetake · 2024-04-06T01:53:26Z

would you mind comparing the binary size and the compilation time perf before/after for stdlib binaries (Zig, Go, and libsodium would be fine)?

evacchi · 2024-04-08T08:31:26Z

I double-checked file sizes and, for all these benchmarks, they do not change (as expected), because all of these test cases are small.
During my tests, in fact, I always double-checked with a (although arguably non-representative) test case (examples/cli)
that the file size didn't grow, and with the original reported case (and other variants).

In short, the difference is small enough, in file size and negligible in time (notice that wazero compile exits early in the first case, because of the panic.)

Original reported case

❯ time wazero compile pg.wasm
264214980
panic: TODO: too large binary where branch target is out of the supported range +/-128MB: -0x8000388
[...]
wazero compile pg.wasm  29.16s user 0.17s system 100% cpu 29.323 total

❯ time wazero compile pg.wasm
266556404
wazero compile pg.wasm  29.37s user 0.17s system 99% cpu 29.794 total

compiled size:

266,556,404 - 264,214,980 = 2,341,424 (+0.886%)

Other Benchmarks

I have omitted file sizes here because they do not change.

Zig. No file size impact, essentially no run-time impact.

goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/stdlibs
                             │ zig-before.txt │          zig-after.txt           │
                             │     sec/op     │   sec/op    vs base              │
Zig/Compile/test-opt.wasm-10       4.704 ± 1%   4.687 ± 1%       ~ (p=0.132 n=6)
Zig/Compile/test.wasm-10           5.835 ± 0%   5.857 ± 1%       ~ (p=0.065 n=6)
geomean                            5.239        5.239       +0.00%

libsodium. No file size impact, about 0.23% run-time impact.

goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/libsodium
                                            │ libsodium-before.txt │          libsodium-after.txt           │
                                            │        sec/op        │      sec/op       vs base              │
Libsodium/box_easy2-10                            0.0007129n ±  5%   0.0006772n ±  9%       ~ (p=0.132 n=6)
Libsodium/kdf_hkdf-10                             0.0003953n ±  4%   0.0003964n ±  3%       ~ (p=0.937 n=6)
Libsodium/auth5-10                                0.0003127n ±  7%   0.0003168n ± 11%       ~ (p=1.000 n=6)
Libsodium/stream2-10                              0.0003393n ±  5%   0.0003427n ±  8%       ~ (p=0.394 n=6)
Libsodium/aead_xchacha20poly1305-10               0.0003348n ±  4%   0.0003343n ±  2%       ~ (p=0.788 n=6)
Libsodium/hash3-10                                0.0002804n ±  5%   0.0002856n ±  4%       ~ (p=0.485 n=6)
Libsodium/aead_chacha20poly1305-10                0.0003540n ± 16%   0.0003385n ±  4%       ~ (p=0.093 n=6)
Libsodium/auth-10                                 0.0003556n ±  6%   0.0003603n ±  2%       ~ (p=0.180 n=6)
Libsodium/onetimeauth-10                          0.0002945n ±  7%   0.0002922n ±  6%       ~ (p=0.937 n=6)
Libsodium/aead_aegis256-10                        0.0004372n ±  4%   0.0004215n ±  6%       ~ (p=0.093 n=6)
Libsodium/scalarmult_ristretto255-10              0.0007270n ±  4%   0.0007688n ±  4%  +5.74% (p=0.009 n=6)
Libsodium/stream3-10                              0.0002701n ±  7%   0.0002811n ± 15%       ~ (p=0.240 n=6)
Libsodium/shorthash-10                            0.0002634n ± 12%   0.0002622n ±  3%       ~ (p=0.310 n=6)
Libsodium/scalarmult-10                           0.0005916n ±  4%   0.0005988n ±  4%       ~ (p=0.221 n=6)
Libsodium/chacha20-10                             0.0003030n ±  7%   0.0003136n ±  3%       ~ (p=0.240 n=6)
Libsodium/onetimeauth7-10                         0.0002718n ±  3%   0.0002714n ±  6%       ~ (p=0.853 n=6)
Libsodium/scalarmult7-10                          0.0005432n ±  2%   0.0005475n ±  1%       ~ (p=0.240 n=6)
Libsodium/auth3-10                                0.0003165n ±  5%   0.0002981n ±  3%  -5.84% (p=0.009 n=6)
Libsodium/stream4-10                              0.0002684n ±  6%   0.0002789n ±  3%       ~ (p=0.093 n=6)
Libsodium/hash-10                                 0.0003428n ±  5%   0.0003498n ±  3%       ~ (p=0.093 n=6)
Libsodium/auth2-10                                0.0003006n ±  7%   0.0003041n ±  3%       ~ (p=0.818 n=6)
Libsodium/scalarmult6-10                          0.0005946n ±  5%   0.0005847n ±  6%       ~ (p=0.180 n=6)
Libsodium/ed25519_convert-10                      0.0008720n ±  1%   0.0009367n ±  4%  +7.43% (p=0.002 n=6)
Libsodium/box_seal-10                             0.0007852n ±  3%   0.0007823n ±  5%       ~ (p=0.818 n=6)
Libsodium/secretbox7-10                           0.0003158n ±  2%   0.0003074n ±  4%       ~ (p=0.180 n=6)
Libsodium/pwhash_argon2i-10                       0.0005476n ±  6%   0.0005491n ±  4%       ~ (p=1.000 n=6)
Libsodium/secretstream_xchacha20poly1305-10       0.0003824n ±  4%   0.0003855n ±  1%       ~ (p=0.485 n=6)
Libsodium/codecs-10                               0.0003263n ±  3%   0.0003268n ±  6%       ~ (p=1.000 n=6)
Libsodium/scalarmult_ed25519-10                   0.0008271n ±  1%   0.0008541n ±  2%  +3.26% (p=0.004 n=6)
Libsodium/sodium_utils-10                         0.0003061n ±  2%   0.0003191n ±  5%       ~ (p=0.132 n=6)
Libsodium/scalarmult5-10                          0.0005680n ±  3%   0.0005488n ±  2%  -3.38% (p=0.002 n=6)
Libsodium/xchacha20-10                            0.0007805n ±  3%   0.0007700n ±  4%       ~ (p=0.818 n=6)
Libsodium/secretbox8-10                           0.0003150n ±  6%   0.0003105n ±  8%       ~ (p=0.589 n=6)
Libsodium/box2-10                                 0.0006106n ±  4%   0.0006109n ±  7%       ~ (p=0.818 n=6)
Libsodium/core3-10                                0.0003376n ±  2%   0.0003367n ±  2%       ~ (p=0.818 n=6)
Libsodium/siphashx24-10                           0.0002583n ±  6%   0.0002516n ±  5%       ~ (p=0.485 n=6)
Libsodium/generichash-10                          0.0004203n ±  2%   0.0004451n ±  5%  +5.90% (p=0.026 n=6)
Libsodium/aead_chacha20poly13052-10               0.0003462n ±  5%   0.0003464n ±  6%       ~ (p=0.818 n=6)
Libsodium/randombytes-10                          0.0002721n ±  3%   0.0002599n ±  2%  -4.50% (p=0.004 n=6)
Libsodium/scalarmult8-10                          0.0005683n ±  2%   0.0005628n ±  6%       ~ (p=0.818 n=6)
Libsodium/kx-10                                   0.0007032n ± 12%   0.0006739n ±  1%  -4.17% (p=0.009 n=6)
Libsodium/stream-10                               0.0003523n ±  3%   0.0003618n ±  3%       ~ (p=0.132 n=6)
Libsodium/auth7-10                                0.0003017n ±  8%   0.0003043n ±  4%       ~ (p=0.699 n=6)
Libsodium/generichash2-10                         0.0003576n ±  5%   0.0003523n ±  1%       ~ (p=0.132 n=6)
Libsodium/box_seed-10                             0.0006024n ±  4%   0.0005815n ±  1%  -3.47% (p=0.002 n=6)
Libsodium/keygen-10                               0.0002788n ±  3%   0.0002735n ±  5%       ~ (p=1.000 n=6)
Libsodium/metamorphic-10                          0.0005393n ±  4%   0.0005164n ±  3%  -4.25% (p=0.004 n=6)
Libsodium/secretbox_easy2-10                      0.0003556n ±  2%   0.0003679n ±  3%  +3.47% (p=0.026 n=6)
Libsodium/sign2-10                                0.0008472n ±  1%   0.0008428n ±  1%       ~ (p=0.394 n=6)
Libsodium/box_easy-10                             0.0006116n ±  5%   0.0006247n ±  5%       ~ (p=0.394 n=6)
Libsodium/secretbox2-10                           0.0003079n ±  2%   0.0003050n ±  6%       ~ (p=0.667 n=6)
Libsodium/box-10                                  0.0006185n ±  6%   0.0006212n ±  5%       ~ (p=0.699 n=6)
Libsodium/kdf-10                                  0.0003596n ±  4%   0.0003609n ± 16%       ~ (p=1.000 n=6)
Libsodium/secretbox_easy-10                       0.0003408n ±  9%   0.0003423n ±  2%       ~ (p=1.000 n=6)
Libsodium/onetimeauth2-10                         0.0002460n ±  5%   0.0002517n ±  2%       ~ (p=0.240 n=6)
Libsodium/generichash3-10                         0.0003480n ±  3%   0.0003482n ±  3%       ~ (p=0.937 n=6)
Libsodium/scalarmult2-10                          0.0005454n ±  2%   0.0005735n ±  4%  +5.17% (p=0.002 n=6)
Libsodium/aead_aegis128l-10                       0.0004260n ±  5%   0.0004270n ± 18%       ~ (p=0.589 n=6)
Libsodium/auth6-10                                0.0002917n ±  3%   0.0002952n ± 12%       ~ (p=0.240 n=6)
Libsodium/secretbox-10                            0.0003171n ±  4%   0.0003254n ±  5%       ~ (p=0.180 n=6)
Libsodium/verify1-10                              0.0002948n ±  3%   0.0002955n ±  4%       ~ (p=0.818 n=6)
geomean                                           0.0004051n         0.0004060n        +0.23%

Wasip1. No file size impact, essentially no run-time impact (0.02%).

goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/stdlibs
                                                   │ wasip1-before.txt │         wasip1-after.txt         │
                                                   │      sec/op       │   sec/op    vs base              │
Wasip1/Compile/src_archive_tar.test-10                      3.195 ± 1%   3.214 ± 1%       ~ (p=0.240 n=6)
Wasip1/Compile/src_bufio.test-10                            1.936 ± 0%   1.941 ± 1%  +0.26% (p=0.015 n=6)
Wasip1/Compile/src_bytes.test-10                            1.999 ± 0%   2.008 ± 1%  +0.49% (p=0.009 n=6)
Wasip1/Compile/src_context.test-10                          2.142 ± 0%   2.150 ± 1%       ~ (p=0.093 n=6)
Wasip1/Compile/src_encoding_ascii85.test-10                 1.748 ± 1%   1.754 ± 0%       ~ (p=0.065 n=6)
Wasip1/Compile/src_encoding_asn1.test-10                    1.987 ± 0%   1.994 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_encoding_base32.test-10                  1.826 ± 1%   1.830 ± 1%       ~ (p=0.240 n=6)
Wasip1/Compile/src_encoding_base64.test-10                  1.840 ± 0%   1.854 ± 1%  +0.74% (p=0.004 n=6)
Wasip1/Compile/src_encoding_binary.test-10                  1.877 ± 1%   1.883 ± 1%  +0.33% (p=0.041 n=6)
Wasip1/Compile/src_encoding_csv.test-10                     1.838 ± 0%   1.843 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_encoding_gob.test-10                     2.428 ± 0%   2.432 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_encoding_hex.test-10                     1.777 ± 0%   1.782 ± 1%       ~ (p=0.485 n=6)
Wasip1/Compile/src_encoding_json.test-10                    5.226 ± 1%   5.228 ± 1%       ~ (p=0.589 n=6)
Wasip1/Compile/src_encoding_pem.test-10                     2.298 ± 1%   2.302 ± 0%       ~ (p=0.240 n=6)
Wasip1/Compile/src_encoding_xml.test-10                     2.130 ± 0%   2.131 ± 0%       ~ (p=0.699 n=6)
Wasip1/Compile/src_errors.test-10                           1.821 ± 1%   1.825 ± 0%       ~ (p=0.485 n=6)
Wasip1/Compile/src_expvar.test-10                           2.654 ± 0%   2.663 ± 0%  +0.33% (p=0.004 n=6)
Wasip1/Compile/src_flag.test-10                             1.953 ± 0%   1.954 ± 0%       ~ (p=0.310 n=6)
Wasip1/Compile/src_fmt.test-10                              1.986 ± 0%   1.988 ± 1%       ~ (p=0.394 n=6)
Wasip1/Compile/src_hash.test-10                             1.795 ± 2%   1.796 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_hash_adler32.test-10                     1.740 ± 1%   1.739 ± 1%       ~ (p=0.818 n=6)
Wasip1/Compile/src_hash_crc32.test-10                       1.760 ± 1%   1.758 ± 1%       ~ (p=0.589 n=6)
Wasip1/Compile/src_hash_crc64.test-10                       1.749 ± 1%   1.744 ± 1%       ~ (p=0.132 n=6)
Wasip1/Compile/src_hash_fnv.test-10                         1.758 ± 2%   1.756 ± 1%       ~ (p=0.485 n=6)
Wasip1/Compile/src_hash_maphash.test-10                     1.767 ± 1%   1.767 ± 0%       ~ (p=1.000 n=6)
Wasip1/Compile/src_io.test-10                               1.935 ± 0%   1.931 ± 1%       ~ (p=0.394 n=6)
Wasip1/Compile/src_io_fs.test-10                            1.932 ± 1%   1.932 ± 0%       ~ (p=0.589 n=6)
Wasip1/Compile/src_io_ioutil.test-10                        1.800 ± 0%   1.799 ± 0%       ~ (p=0.937 n=6)
Wasip1/Compile/src_log.test-10                              1.775 ± 0%   1.772 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_log_syslog.test-10                       1.753 ± 1%   1.754 ± 1%       ~ (p=0.818 n=6)
Wasip1/Compile/src_maps.test-10                             1.764 ± 1%   1.770 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_math.test-10                             1.915 ± 0%   1.911 ± 1%       ~ (p=0.132 n=6)
Wasip1/Compile/src_math_big.test-10                         3.548 ± 0%   3.550 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_math_bits.test-10                        1.807 ± 3%   1.781 ± 0%       ~ (p=0.132 n=6)
Wasip1/Compile/src_math_cmplx.test-10                       1.816 ± 2%   1.801 ± 1%  -0.83% (p=0.026 n=6)
Wasip1/Compile/src_math_rand.test-10                        2.872 ± 0%   2.868 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_mime.test-10                             1.921 ± 1%   1.917 ± 0%       ~ (p=0.699 n=6)
Wasip1/Compile/src_mime_multipart.test-10                   2.169 ± 0%   2.162 ± 1%       ~ (p=0.180 n=6)
Wasip1/Compile/src_mime_quotedprintable.test-10             1.845 ± 0%   1.845 ± 0%       ~ (p=1.000 n=6)
Wasip1/Compile/src_os.test-10                               2.441 ± 0%   2.436 ± 1%       ~ (p=0.310 n=6)
Wasip1/Compile/src_os_exec.test-10                          4.548 ± 0%   4.548 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_os_signal.test-10                        1.729 ± 0%   1.727 ± 0%       ~ (p=0.937 n=6)
Wasip1/Compile/src_os_user.test-10                          1.772 ± 0%   1.769 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_path.test-10                             1.767 ± 1%   1.756 ± 1%  -0.61% (p=0.004 n=6)
Wasip1/Compile/src_reflect.test-10                          4.009 ± 0%   4.003 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_regexp.test-10                           2.094 ± 0%   2.086 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_regexp_syntax.test-10                    1.791 ± 1%   1.803 ± 1%       ~ (p=0.065 n=6)
Wasip1/Compile/src_runtime.test-10                          6.106 ± 0%   6.118 ± 2%       ~ (p=0.093 n=6)
Wasip1/Compile/src_runtime_internal_atomic.test-10          1.762 ± 0%   1.764 ± 1%       ~ (p=0.310 n=6)
Wasip1/Compile/src_runtime_internal_math.test-10            1.733 ± 0%   1.737 ± 1%       ~ (p=0.394 n=6)
Wasip1/Compile/src_runtime_internal_sys.test-10             1.733 ± 0%   1.732 ± 1%       ~ (p=0.937 n=6)
Wasip1/Compile/src_slices.test-10                           1.972 ± 1%   1.978 ± 0%       ~ (p=0.132 n=6)
Wasip1/Compile/src_sort.test-10                             1.845 ± 0%   1.845 ± 1%       ~ (p=0.699 n=6)
Wasip1/Compile/src_strconv.test-10                          1.957 ± 0%   1.952 ± 0%       ~ (p=0.394 n=6)
Wasip1/Compile/src_strings.test-10                          2.034 ± 0%   2.031 ± 1%       ~ (p=0.180 n=6)
Wasip1/Compile/src_sync.test-10                             2.030 ± 1%   2.031 ± 0%       ~ (p=0.937 n=6)
Wasip1/Compile/src_sync_atomic.test-10                      1.855 ± 0%   1.854 ± 0%       ~ (p=1.000 n=6)
Wasip1/Compile/src_syscall.test-10                          1.742 ± 0%   1.741 ± 0%       ~ (p=0.589 n=6)
Wasip1/Compile/src_testing.test-10                          2.601 ± 0%   2.600 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_testing_fstest.test-10                   1.986 ± 0%   1.989 ± 0%       ~ (p=0.240 n=6)
Wasip1/Compile/src_testing_iotest.test-10                   1.798 ± 1%   1.800 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_testing_quick.test-10                    1.864 ± 0%   1.863 ± 1%       ~ (p=0.589 n=6)
Wasip1/Compile/src_time.test-10                             2.991 ± 0%   2.993 ± 0%       ~ (p=0.818 n=6)
geomean                                                     2.095        2.096       +0.02%

mathetake · 2024-04-09T01:29:41Z

internal/engine/wazevo/engine.go

 			r.Offset += int64(totalSize)
 			e.rels = append(e.rels, r)
 		}

 		bodies[i] = body
-		totalSize += len(body)
+		totalSize += len(body) + e.machine.RelocationTrampolineSize(rels)


I think the logic is broken here. Inside RelocationTrampolineSize, it checks rels[i].TrampolineOffset > 0 but TrampolineOffset is always zero at this point. In fact

--- a/internal/engine/wazevo/engine.go +++ b/internal/engine/wazevo/engine.go @@ -263,7 +263,11 @@ func (e *engine) compileModule(ctx context.Context, module *wasm.Module, listene } bodies[i] = body - totalSize += len(body) + e.machine.RelocationTrampolineSize(rels) + s := e.machine.RelocationTrampolineSize(rels) + if s != 0 { + panic("MUST HIT") + } + totalSize += len(body) + s if wazevoapi.PrintMachineCodeHexPerFunction { fmt.Printf("[[[machine code for %s]]]\n%s\n\n", wazevoapi.GetCurrentFunctionName(ctx), hex.EncodeToString(body)) }

this obvious assertion doesn't get hit after running wazero run pg.wasm.

This indicates that the current code is not starting with the worst case size, but instead "smallest size" scenario.

Silly me i must have introduced that bug in e.machine.RelocationTrampolineSize(rels) while writing and cleaning up the tests. Earlier it just multiplied by the size of rels

mathetake · 2024-04-09T05:18:22Z

ok I think I will take this over. Thank you for your hard work and hints here @evacchi

mathetake · 2024-04-09T08:20:23Z

superseded by #2181

evacchi commented Apr 2, 2024

View reviewed changes

wazevo(arm64): trampoline for far calls on relocation

2be42ae

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi force-pushed the far-call branch from 04c5c94 to 2be42ae Compare April 2, 2024 20:33

mathetake reviewed Apr 2, 2024

View reviewed changes

internal/engine/wazevo/backend/isa/arm64/machine_relocation.go Outdated Show resolved Hide resolved

mathetake reviewed Apr 2, 2024

View reviewed changes

internal/engine/wazevo/engine.go Outdated Show resolved Hide resolved

evacchi added 2 commits April 3, 2024 11:10

just return RelocationInfo instead of passing a pointer

5108349

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

use BR instead of BLR

ed83eb3

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

Precompute maximal trampoline size, then allocate "on demand".

ad49ce0

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi force-pushed the far-call branch from 3c69713 to ad49ce0 Compare April 4, 2024 12:10

evacchi commented Apr 4, 2024

View reviewed changes

internal/engine/wazevo/backend/machine.go Show resolved Hide resolved

evacchi commented Apr 4, 2024

View reviewed changes

evacchi added 2 commits April 4, 2024 16:44

Formatting, refactoring, cleanup

0b20680

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

Unit testing

9418ad5

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi mentioned this pull request Apr 4, 2024

compiler(arm64) overflow in call relocations #2158

Closed

mathetake reviewed Apr 4, 2024

View reviewed changes

internal/engine/wazevo/backend/isa/arm64/machine_relocation.go Outdated Show resolved Hide resolved

internal/engine/wazevo/backend/machine.go Outdated Show resolved Hide resolved

evacchi added 3 commits April 5, 2024 14:17

Apply suggestions

a3931da

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

Add more test cases

5db5e18

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

Missing doc comment

b149416

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>

evacchi marked this pull request as ready for review April 5, 2024 19:02

evacchi requested a review from mathetake April 5, 2024 19:02

mathetake reviewed Apr 9, 2024

View reviewed changes

mathetake mentioned this pull request Apr 9, 2024

GOOS=wasip1 embedding wazero succeeds with wasmtime but fails with wazero on arm64 #2179

Closed

mathetake mentioned this pull request Apr 9, 2024

compiler(arm64): fixes overflow in huge executable relocations #2181

Merged

mathetake closed this Apr 9, 2024

mathetake deleted the far-call branch April 9, 2024 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wazevo(arm64): trampoline for far calls on relocation #2169

wazevo(arm64): trampoline for far calls on relocation #2169

evacchi commented Apr 2, 2024 •

edited

Loading

evacchi Apr 2, 2024

mathetake left a comment •

edited

Loading

evacchi commented Apr 3, 2024 •

edited

Loading

evacchi commented Apr 3, 2024

evacchi commented Apr 4, 2024

evacchi Apr 4, 2024

evacchi commented Apr 4, 2024

mathetake left a comment

mathetake commented Apr 6, 2024

evacchi commented Apr 8, 2024 •

edited

Loading

mathetake Apr 9, 2024 •

edited

Loading

mathetake Apr 9, 2024

evacchi Apr 9, 2024 •

edited

Loading

mathetake commented Apr 9, 2024

mathetake commented Apr 9, 2024

wazevo(arm64): trampoline for far calls on relocation #2169

wazevo(arm64): trampoline for far calls on relocation #2169

Conversation

evacchi commented Apr 2, 2024 • edited Loading

evacchi Apr 2, 2024

Choose a reason for hiding this comment

mathetake left a comment • edited Loading

Choose a reason for hiding this comment

evacchi commented Apr 3, 2024 • edited Loading

evacchi commented Apr 3, 2024

evacchi commented Apr 4, 2024

evacchi Apr 4, 2024

Choose a reason for hiding this comment

evacchi commented Apr 4, 2024

mathetake left a comment

Choose a reason for hiding this comment

mathetake commented Apr 6, 2024

evacchi commented Apr 8, 2024 • edited Loading

Original reported case

Other Benchmarks

mathetake Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

mathetake Apr 9, 2024

Choose a reason for hiding this comment

evacchi Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

mathetake commented Apr 9, 2024

mathetake commented Apr 9, 2024

evacchi commented Apr 2, 2024 •

edited

Loading

mathetake left a comment •

edited

Loading

evacchi commented Apr 3, 2024 •

edited

Loading

evacchi commented Apr 8, 2024 •

edited

Loading

mathetake Apr 9, 2024 •

edited

Loading

evacchi Apr 9, 2024 •

edited

Loading