Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wazevo(arm64): trampoline for far calls on relocation #2169

Closed
wants to merge 9 commits into from
Closed

Conversation

evacchi
Copy link
Contributor

@evacchi evacchi commented Apr 2, 2024

Fixes #2158, follows up to #2167.

Draft for feedback on the approach; also needs tests.

  • Essentially, this adds some padding (r.TrampolineOffset) at the end of a function to accommodate a trampoline for far jumps (using BRL).
  • The trampoline will be written if the relative jump is outside the accepted range.
  • When such an invalid relative jump is detected, instead of jumping to the function we jump to the trampoline:
    • 64-bit mov to tmp reg -- movz/movk/movk/movk
    • BRL to tmp
    • unconditional branch back to the origin
  • Note: This is currently adding such a padding for all relocated calls, so obviously it is not optimal (will continue working on it later then undraft. I think it should be doable with some minor refactoring); however there is only waste of space: we use the trampoline only for far jumps.
  • Also note: this is using a BLR. In other places (e.g. resolveRelativeAddress) we use relative jumps but at that time 1) we don't know the absolute address yet and 2) we can inject instruction at any point within the function because we are resolving the addresses before encoding.

Moreover, BRL's 64-bit addressing should obviously be comfortable enough to reach any point in the executable.

This was manually tested against the reproducers (top post and others in the comments) in #2158 and verified to work, but I will add proper tests.

Signed-off-by: Edoardo Vacchi evacchi@users.noreply.github.com

Comment on lines 61 to 67
encodeMoveWideImmediate(movzOp, tmpReg, uint64(uint16(addr)), 0, 1),
encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>16)), 1, 1),
encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>32)), 2, 1),
encodeMoveWideImmediate(movkOp, tmpReg, uint64(uint16(addr>>48)), 3, 1),
encodeUnconditionalBranchReg(tmpReg, true),
encodeUnconditionalBranch(false, returnOffset-6*4),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible that we can resolve the width of the jump earlier and either generate a smaller trampoline, with less instructions, or use a different strategy for encoding the address (for now I just wanted to see if the general approach was working)

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Copy link
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the direction looks good, but did you do some research on how other linkers/comilers deal with this?

internal/engine/wazevo/engine.go Outdated Show resolved Hide resolved
@evacchi
Copy link
Contributor Author

evacchi commented Apr 3, 2024

did you do some research on how other linkers/comilers deal with this?

Yes, I will add some notes inline in the code. Here's my summary.

On ARM there are 3 types of "veneer" (trampoline/thunk in ARM parlance)

The main issue is where to put them. My first thought was to tack it at the end of the current function for simplicity, but how do linkers do it?

Because the job of a traditional linker is to shuffle code around, it's cheap for them to generate new code and then point the branch to it; essentially, they can generate a new (internal) symbol. This online book ("Embedded Systems Security and TrustZone") gives a nice explanation of how this works. But in our case adding a new symbol/code section would mean adding a new functionOffset and mmapping a new segment.

Placing these new segments now represents a problem of its own; for instance:

  • we could add these new segments as we need them, incrementing the i in cm.functionOffsets[i]; e.g. say we are processing function i and we discover we need k trampolines, then we would generate those k sections and add them at i+1, ..., i+k,
    but then we would lose the correspondence between cm.functionOffsets[i] and module.CodeSection[i] (as well as other similar cases), and that would cascade to the surrounding code.

  • Instead, we could add them to the end of functionOffsets, with j=len(functionOffsets), .... This however means that, depending on the order in which we generate and mmap them they could be even further than the original function that needed to be invoked.

  • Besides, this also adds more mmapped code to be collected and finalized.

In fact, these sections can be fused with the code of a function: this behavior can be controlled through linker scripts; for instance, the veneer for interlinking is joined with the .text section by the default linker script.

So, tacking it to the bottom of the original function is actually a behavior that can be controlled even in traditional linkers, and, in our case, it addresses the other problems; assuming that the bottom of the function is within range for a relative jump.

I currently ruled out encoding the trampoline as a short-range thunk for simplicity: it's an optimization we can address later, and it was important to get to the bottom of this first.

Another optimization might be detecting multiple invocations to the same function and jump to the same veneer.

Finally, the text itself of the veneer: how do you load the address? There are a few possible approaches; in my case, again, for simplicity, I am movz/movk/movk/movk'ing the address to the tmp reg which we know is safe to use. Another way to load the address is similar to brTableSequence, using adr and a constant pool. In some cases (when the address is especially large), adrp+add might need to be used instead (adrp addresses by 4KB-pages). E.g. the example in the book uses ldr r0, [pc, #8] which is essentially what adr r0, #8 would translate to.

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
@evacchi
Copy link
Contributor Author

evacchi commented Apr 3, 2024

I improved the PoC but it's not clean enough for reviewing, so I'll push the code later, but I'll write some notes here.

I wanted to make sure that we do not waste too much space for trampolines that are not needed. In the code here we always allocate the space for a trampoline for each relocation, regardless if this trampoline will be needed (i.e. when the relocation does not require a far jump).

Now, suppose that we have functions fk, fk1, fk2 all adjacent to each other. Suppose that fk needs more space to accommodate a trampoline to fk2. If we do a linear scan, at this point, we can't know for certain the offset of fk2 because we don't know if also fk1 will require some trampolines (causing it to increase in size).

The solution is obviously to iterate more than once; but how many times? We could compute a fixed point, but instead:

  • we can do a first pass assuming the "worst case" scenario, where each relocation really needs a trampoline;
    • this way, we can precompute the size of each function with all the trampolines (and thus their offsets) knowing that they can't grow further than that
  • we iterate over the relocations and flag those that really need a trampoline:
    • now, for each function, if any of the offset diffs is within range, we know for certain that that trampoline won't be needed
    • conversely, if any of the offset diffs is outside the acceptable range, we will just assume that a trampoline will be needed to avoid further iterations (we can further optimize the edge cases in the future)
  • we do a final pass where we actually allocate space at the end of each function and write the trampolines

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
@evacchi
Copy link
Contributor Author

evacchi commented Apr 4, 2024

Pushed change following #2169 (comment); actually, slightly improved because I realized I do not need to do 3 separate passes, but only 2 are really necessary.

  • First precompute all the offsets including "maximal" sizes for trampolines
  • Scan forward, recompute the diffs, and account for the trampolines that are needed.

We don't even need to actually allocate the space (body = append(body, make([]byte, N)), because when we mmap and copy over the bodies, they are laid out at the given offset, which will account for trampolines. When we write the trampolines later we overwrite all the uninitialized bytes, so everything works.

This still needs proper tests.

@@ -220,6 +220,7 @@ func (e *engine) compileModule(ctx context.Context, module *wasm.Module, listene
totalSize := 0 // Total binary size of the executable.
cm.functionOffsets = make([]int, localFns)
bodies := make([][]byte, localFns)
sourceOffsets := make([][]backend.SourceOffsetInfo, localFns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to append the sourceOffsets in the second stage, so we need to store be.SourceOffsetInfo() somewhere. I don't love this change, but I could not think of anything better at the moment...

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
@evacchi
Copy link
Contributor Author

evacchi commented Apr 4, 2024

Added a few test cases, will continue later.

Copy link
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the right track I guess!

internal/engine/wazevo/backend/machine.go Outdated Show resolved Hide resolved
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
@evacchi evacchi marked this pull request as ready for review April 5, 2024 19:02
@evacchi evacchi requested a review from mathetake April 5, 2024 19:02
@mathetake
Copy link
Member

would you mind comparing the binary size and the compilation time perf before/after for stdlib binaries (Zig, Go, and libsodium would be fine)?

@evacchi
Copy link
Contributor Author

evacchi commented Apr 8, 2024

I double-checked file sizes and, for all these benchmarks, they do not change (as expected), because all of these test cases are small.
During my tests, in fact, I always double-checked with a (although arguably non-representative) test case (examples/cli)
that the file size didn't grow, and with the original reported case (and other variants).

In short, the difference is small enough, in file size and negligible in time (notice that wazero compile exits early in the first case, because of the panic.)

Original reported case

❯ time wazero compile pg.wasm
264214980
panic: TODO: too large binary where branch target is out of the supported range +/-128MB: -0x8000388
[...]
wazero compile pg.wasm  29.16s user 0.17s system 100% cpu 29.323 total
❯ time wazero compile pg.wasm
266556404
wazero compile pg.wasm  29.37s user 0.17s system 99% cpu 29.794 total

compiled size:

266,556,404 - 264,214,980 = 2,341,424 (+0.886%)

Other Benchmarks

I have omitted file sizes here because they do not change.

Zig. No file size impact, essentially no run-time impact.
goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/stdlibs
                             │ zig-before.txt │          zig-after.txt           │
                             │     sec/op     │   sec/op    vs base              │
Zig/Compile/test-opt.wasm-10       4.704 ± 1%   4.687 ± 1%       ~ (p=0.132 n=6)
Zig/Compile/test.wasm-10           5.835 ± 0%   5.857 ± 1%       ~ (p=0.065 n=6)
geomean                            5.239        5.239       +0.00%
libsodium. No file size impact, about 0.23% run-time impact.
goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/libsodium
                                            │ libsodium-before.txt │          libsodium-after.txt           │
                                            │        sec/op        │      sec/op       vs base              │
Libsodium/box_easy2-10                            0.0007129n ±  5%   0.0006772n ±  9%       ~ (p=0.132 n=6)
Libsodium/kdf_hkdf-10                             0.0003953n ±  4%   0.0003964n ±  3%       ~ (p=0.937 n=6)
Libsodium/auth5-10                                0.0003127n ±  7%   0.0003168n ± 11%       ~ (p=1.000 n=6)
Libsodium/stream2-10                              0.0003393n ±  5%   0.0003427n ±  8%       ~ (p=0.394 n=6)
Libsodium/aead_xchacha20poly1305-10               0.0003348n ±  4%   0.0003343n ±  2%       ~ (p=0.788 n=6)
Libsodium/hash3-10                                0.0002804n ±  5%   0.0002856n ±  4%       ~ (p=0.485 n=6)
Libsodium/aead_chacha20poly1305-10                0.0003540n ± 16%   0.0003385n ±  4%       ~ (p=0.093 n=6)
Libsodium/auth-10                                 0.0003556n ±  6%   0.0003603n ±  2%       ~ (p=0.180 n=6)
Libsodium/onetimeauth-10                          0.0002945n ±  7%   0.0002922n ±  6%       ~ (p=0.937 n=6)
Libsodium/aead_aegis256-10                        0.0004372n ±  4%   0.0004215n ±  6%       ~ (p=0.093 n=6)
Libsodium/scalarmult_ristretto255-10              0.0007270n ±  4%   0.0007688n ±  4%  +5.74% (p=0.009 n=6)
Libsodium/stream3-10                              0.0002701n ±  7%   0.0002811n ± 15%       ~ (p=0.240 n=6)
Libsodium/shorthash-10                            0.0002634n ± 12%   0.0002622n ±  3%       ~ (p=0.310 n=6)
Libsodium/scalarmult-10                           0.0005916n ±  4%   0.0005988n ±  4%       ~ (p=0.221 n=6)
Libsodium/chacha20-10                             0.0003030n ±  7%   0.0003136n ±  3%       ~ (p=0.240 n=6)
Libsodium/onetimeauth7-10                         0.0002718n ±  3%   0.0002714n ±  6%       ~ (p=0.853 n=6)
Libsodium/scalarmult7-10                          0.0005432n ±  2%   0.0005475n ±  1%       ~ (p=0.240 n=6)
Libsodium/auth3-10                                0.0003165n ±  5%   0.0002981n ±  3%  -5.84% (p=0.009 n=6)
Libsodium/stream4-10                              0.0002684n ±  6%   0.0002789n ±  3%       ~ (p=0.093 n=6)
Libsodium/hash-10                                 0.0003428n ±  5%   0.0003498n ±  3%       ~ (p=0.093 n=6)
Libsodium/auth2-10                                0.0003006n ±  7%   0.0003041n ±  3%       ~ (p=0.818 n=6)
Libsodium/scalarmult6-10                          0.0005946n ±  5%   0.0005847n ±  6%       ~ (p=0.180 n=6)
Libsodium/ed25519_convert-10                      0.0008720n ±  1%   0.0009367n ±  4%  +7.43% (p=0.002 n=6)
Libsodium/box_seal-10                             0.0007852n ±  3%   0.0007823n ±  5%       ~ (p=0.818 n=6)
Libsodium/secretbox7-10                           0.0003158n ±  2%   0.0003074n ±  4%       ~ (p=0.180 n=6)
Libsodium/pwhash_argon2i-10                       0.0005476n ±  6%   0.0005491n ±  4%       ~ (p=1.000 n=6)
Libsodium/secretstream_xchacha20poly1305-10       0.0003824n ±  4%   0.0003855n ±  1%       ~ (p=0.485 n=6)
Libsodium/codecs-10                               0.0003263n ±  3%   0.0003268n ±  6%       ~ (p=1.000 n=6)
Libsodium/scalarmult_ed25519-10                   0.0008271n ±  1%   0.0008541n ±  2%  +3.26% (p=0.004 n=6)
Libsodium/sodium_utils-10                         0.0003061n ±  2%   0.0003191n ±  5%       ~ (p=0.132 n=6)
Libsodium/scalarmult5-10                          0.0005680n ±  3%   0.0005488n ±  2%  -3.38% (p=0.002 n=6)
Libsodium/xchacha20-10                            0.0007805n ±  3%   0.0007700n ±  4%       ~ (p=0.818 n=6)
Libsodium/secretbox8-10                           0.0003150n ±  6%   0.0003105n ±  8%       ~ (p=0.589 n=6)
Libsodium/box2-10                                 0.0006106n ±  4%   0.0006109n ±  7%       ~ (p=0.818 n=6)
Libsodium/core3-10                                0.0003376n ±  2%   0.0003367n ±  2%       ~ (p=0.818 n=6)
Libsodium/siphashx24-10                           0.0002583n ±  6%   0.0002516n ±  5%       ~ (p=0.485 n=6)
Libsodium/generichash-10                          0.0004203n ±  2%   0.0004451n ±  5%  +5.90% (p=0.026 n=6)
Libsodium/aead_chacha20poly13052-10               0.0003462n ±  5%   0.0003464n ±  6%       ~ (p=0.818 n=6)
Libsodium/randombytes-10                          0.0002721n ±  3%   0.0002599n ±  2%  -4.50% (p=0.004 n=6)
Libsodium/scalarmult8-10                          0.0005683n ±  2%   0.0005628n ±  6%       ~ (p=0.818 n=6)
Libsodium/kx-10                                   0.0007032n ± 12%   0.0006739n ±  1%  -4.17% (p=0.009 n=6)
Libsodium/stream-10                               0.0003523n ±  3%   0.0003618n ±  3%       ~ (p=0.132 n=6)
Libsodium/auth7-10                                0.0003017n ±  8%   0.0003043n ±  4%       ~ (p=0.699 n=6)
Libsodium/generichash2-10                         0.0003576n ±  5%   0.0003523n ±  1%       ~ (p=0.132 n=6)
Libsodium/box_seed-10                             0.0006024n ±  4%   0.0005815n ±  1%  -3.47% (p=0.002 n=6)
Libsodium/keygen-10                               0.0002788n ±  3%   0.0002735n ±  5%       ~ (p=1.000 n=6)
Libsodium/metamorphic-10                          0.0005393n ±  4%   0.0005164n ±  3%  -4.25% (p=0.004 n=6)
Libsodium/secretbox_easy2-10                      0.0003556n ±  2%   0.0003679n ±  3%  +3.47% (p=0.026 n=6)
Libsodium/sign2-10                                0.0008472n ±  1%   0.0008428n ±  1%       ~ (p=0.394 n=6)
Libsodium/box_easy-10                             0.0006116n ±  5%   0.0006247n ±  5%       ~ (p=0.394 n=6)
Libsodium/secretbox2-10                           0.0003079n ±  2%   0.0003050n ±  6%       ~ (p=0.667 n=6)
Libsodium/box-10                                  0.0006185n ±  6%   0.0006212n ±  5%       ~ (p=0.699 n=6)
Libsodium/kdf-10                                  0.0003596n ±  4%   0.0003609n ± 16%       ~ (p=1.000 n=6)
Libsodium/secretbox_easy-10                       0.0003408n ±  9%   0.0003423n ±  2%       ~ (p=1.000 n=6)
Libsodium/onetimeauth2-10                         0.0002460n ±  5%   0.0002517n ±  2%       ~ (p=0.240 n=6)
Libsodium/generichash3-10                         0.0003480n ±  3%   0.0003482n ±  3%       ~ (p=0.937 n=6)
Libsodium/scalarmult2-10                          0.0005454n ±  2%   0.0005735n ±  4%  +5.17% (p=0.002 n=6)
Libsodium/aead_aegis128l-10                       0.0004260n ±  5%   0.0004270n ± 18%       ~ (p=0.589 n=6)
Libsodium/auth6-10                                0.0002917n ±  3%   0.0002952n ± 12%       ~ (p=0.240 n=6)
Libsodium/secretbox-10                            0.0003171n ±  4%   0.0003254n ±  5%       ~ (p=0.180 n=6)
Libsodium/verify1-10                              0.0002948n ±  3%   0.0002955n ±  4%       ~ (p=0.818 n=6)
geomean                                           0.0004051n         0.0004060n        +0.23%
Wasip1. No file size impact, essentially no run-time impact (0.02%).
goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/stdlibs
                                                   │ wasip1-before.txt │         wasip1-after.txt         │
                                                   │      sec/op       │   sec/op    vs base              │
Wasip1/Compile/src_archive_tar.test-10                      3.195 ± 1%   3.214 ± 1%       ~ (p=0.240 n=6)
Wasip1/Compile/src_bufio.test-10                            1.936 ± 0%   1.941 ± 1%  +0.26% (p=0.015 n=6)
Wasip1/Compile/src_bytes.test-10                            1.999 ± 0%   2.008 ± 1%  +0.49% (p=0.009 n=6)
Wasip1/Compile/src_context.test-10                          2.142 ± 0%   2.150 ± 1%       ~ (p=0.093 n=6)
Wasip1/Compile/src_encoding_ascii85.test-10                 1.748 ± 1%   1.754 ± 0%       ~ (p=0.065 n=6)
Wasip1/Compile/src_encoding_asn1.test-10                    1.987 ± 0%   1.994 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_encoding_base32.test-10                  1.826 ± 1%   1.830 ± 1%       ~ (p=0.240 n=6)
Wasip1/Compile/src_encoding_base64.test-10                  1.840 ± 0%   1.854 ± 1%  +0.74% (p=0.004 n=6)
Wasip1/Compile/src_encoding_binary.test-10                  1.877 ± 1%   1.883 ± 1%  +0.33% (p=0.041 n=6)
Wasip1/Compile/src_encoding_csv.test-10                     1.838 ± 0%   1.843 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_encoding_gob.test-10                     2.428 ± 0%   2.432 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_encoding_hex.test-10                     1.777 ± 0%   1.782 ± 1%       ~ (p=0.485 n=6)
Wasip1/Compile/src_encoding_json.test-10                    5.226 ± 1%   5.228 ± 1%       ~ (p=0.589 n=6)
Wasip1/Compile/src_encoding_pem.test-10                     2.298 ± 1%   2.302 ± 0%       ~ (p=0.240 n=6)
Wasip1/Compile/src_encoding_xml.test-10                     2.130 ± 0%   2.131 ± 0%       ~ (p=0.699 n=6)
Wasip1/Compile/src_errors.test-10                           1.821 ± 1%   1.825 ± 0%       ~ (p=0.485 n=6)
Wasip1/Compile/src_expvar.test-10                           2.654 ± 0%   2.663 ± 0%  +0.33% (p=0.004 n=6)
Wasip1/Compile/src_flag.test-10                             1.953 ± 0%   1.954 ± 0%       ~ (p=0.310 n=6)
Wasip1/Compile/src_fmt.test-10                              1.986 ± 0%   1.988 ± 1%       ~ (p=0.394 n=6)
Wasip1/Compile/src_hash.test-10                             1.795 ± 2%   1.796 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_hash_adler32.test-10                     1.740 ± 1%   1.739 ± 1%       ~ (p=0.818 n=6)
Wasip1/Compile/src_hash_crc32.test-10                       1.760 ± 1%   1.758 ± 1%       ~ (p=0.589 n=6)
Wasip1/Compile/src_hash_crc64.test-10                       1.749 ± 1%   1.744 ± 1%       ~ (p=0.132 n=6)
Wasip1/Compile/src_hash_fnv.test-10                         1.758 ± 2%   1.756 ± 1%       ~ (p=0.485 n=6)
Wasip1/Compile/src_hash_maphash.test-10                     1.767 ± 1%   1.767 ± 0%       ~ (p=1.000 n=6)
Wasip1/Compile/src_io.test-10                               1.935 ± 0%   1.931 ± 1%       ~ (p=0.394 n=6)
Wasip1/Compile/src_io_fs.test-10                            1.932 ± 1%   1.932 ± 0%       ~ (p=0.589 n=6)
Wasip1/Compile/src_io_ioutil.test-10                        1.800 ± 0%   1.799 ± 0%       ~ (p=0.937 n=6)
Wasip1/Compile/src_log.test-10                              1.775 ± 0%   1.772 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_log_syslog.test-10                       1.753 ± 1%   1.754 ± 1%       ~ (p=0.818 n=6)
Wasip1/Compile/src_maps.test-10                             1.764 ± 1%   1.770 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_math.test-10                             1.915 ± 0%   1.911 ± 1%       ~ (p=0.132 n=6)
Wasip1/Compile/src_math_big.test-10                         3.548 ± 0%   3.550 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_math_bits.test-10                        1.807 ± 3%   1.781 ± 0%       ~ (p=0.132 n=6)
Wasip1/Compile/src_math_cmplx.test-10                       1.816 ± 2%   1.801 ± 1%  -0.83% (p=0.026 n=6)
Wasip1/Compile/src_math_rand.test-10                        2.872 ± 0%   2.868 ± 0%       ~ (p=0.093 n=6)
Wasip1/Compile/src_mime.test-10                             1.921 ± 1%   1.917 ± 0%       ~ (p=0.699 n=6)
Wasip1/Compile/src_mime_multipart.test-10                   2.169 ± 0%   2.162 ± 1%       ~ (p=0.180 n=6)
Wasip1/Compile/src_mime_quotedprintable.test-10             1.845 ± 0%   1.845 ± 0%       ~ (p=1.000 n=6)
Wasip1/Compile/src_os.test-10                               2.441 ± 0%   2.436 ± 1%       ~ (p=0.310 n=6)
Wasip1/Compile/src_os_exec.test-10                          4.548 ± 0%   4.548 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_os_signal.test-10                        1.729 ± 0%   1.727 ± 0%       ~ (p=0.937 n=6)
Wasip1/Compile/src_os_user.test-10                          1.772 ± 0%   1.769 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_path.test-10                             1.767 ± 1%   1.756 ± 1%  -0.61% (p=0.004 n=6)
Wasip1/Compile/src_reflect.test-10                          4.009 ± 0%   4.003 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_regexp.test-10                           2.094 ± 0%   2.086 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_regexp_syntax.test-10                    1.791 ± 1%   1.803 ± 1%       ~ (p=0.065 n=6)
Wasip1/Compile/src_runtime.test-10                          6.106 ± 0%   6.118 ± 2%       ~ (p=0.093 n=6)
Wasip1/Compile/src_runtime_internal_atomic.test-10          1.762 ± 0%   1.764 ± 1%       ~ (p=0.310 n=6)
Wasip1/Compile/src_runtime_internal_math.test-10            1.733 ± 0%   1.737 ± 1%       ~ (p=0.394 n=6)
Wasip1/Compile/src_runtime_internal_sys.test-10             1.733 ± 0%   1.732 ± 1%       ~ (p=0.937 n=6)
Wasip1/Compile/src_slices.test-10                           1.972 ± 1%   1.978 ± 0%       ~ (p=0.132 n=6)
Wasip1/Compile/src_sort.test-10                             1.845 ± 0%   1.845 ± 1%       ~ (p=0.699 n=6)
Wasip1/Compile/src_strconv.test-10                          1.957 ± 0%   1.952 ± 0%       ~ (p=0.394 n=6)
Wasip1/Compile/src_strings.test-10                          2.034 ± 0%   2.031 ± 1%       ~ (p=0.180 n=6)
Wasip1/Compile/src_sync.test-10                             2.030 ± 1%   2.031 ± 0%       ~ (p=0.937 n=6)
Wasip1/Compile/src_sync_atomic.test-10                      1.855 ± 0%   1.854 ± 0%       ~ (p=1.000 n=6)
Wasip1/Compile/src_syscall.test-10                          1.742 ± 0%   1.741 ± 0%       ~ (p=0.589 n=6)
Wasip1/Compile/src_testing.test-10                          2.601 ± 0%   2.600 ± 0%       ~ (p=0.818 n=6)
Wasip1/Compile/src_testing_fstest.test-10                   1.986 ± 0%   1.989 ± 0%       ~ (p=0.240 n=6)
Wasip1/Compile/src_testing_iotest.test-10                   1.798 ± 1%   1.800 ± 0%       ~ (p=0.180 n=6)
Wasip1/Compile/src_testing_quick.test-10                    1.864 ± 0%   1.863 ± 1%       ~ (p=0.589 n=6)
Wasip1/Compile/src_time.test-10                             2.991 ± 0%   2.993 ± 0%       ~ (p=0.818 n=6)
geomean                                                     2.095        2.096       +0.02%

r.Offset += int64(totalSize)
e.rels = append(e.rels, r)
}

bodies[i] = body
totalSize += len(body)
totalSize += len(body) + e.machine.RelocationTrampolineSize(rels)
Copy link
Member

@mathetake mathetake Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is broken here. Inside RelocationTrampolineSize, it checks rels[i].TrampolineOffset > 0 but TrampolineOffset is always zero at this point. In fact

--- a/internal/engine/wazevo/engine.go
+++ b/internal/engine/wazevo/engine.go
@@ -263,7 +263,11 @@ func (e *engine) compileModule(ctx context.Context, module *wasm.Module, listene
                }
 
                bodies[i] = body
-               totalSize += len(body) + e.machine.RelocationTrampolineSize(rels)
+               s := e.machine.RelocationTrampolineSize(rels)
+               if s != 0 {
+                       panic("MUST HIT")
+               }
+               totalSize += len(body) + s
                if wazevoapi.PrintMachineCodeHexPerFunction {
                        fmt.Printf("[[[machine code for %s]]]\n%s\n\n", wazevoapi.GetCurrentFunctionName(ctx), hex.EncodeToString(body))
                }

this obvious assertion doesn't get hit after running wazero run pg.wasm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates that the current code is not starting with the worst case size, but instead "smallest size" scenario.

Copy link
Contributor Author

@evacchi evacchi Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly me i must have introduced that bug in e.machine.RelocationTrampolineSize(rels) while writing and cleaning up the tests. Earlier it just multiplied by the size of rels

@mathetake
Copy link
Member

ok I think I will take this over. Thank you for your hard work and hints here @evacchi

@mathetake
Copy link
Member

superseded by #2181

@mathetake mathetake closed this Apr 9, 2024
@mathetake mathetake deleted the far-call branch April 9, 2024 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compiler(arm64) overflow in call relocations
2 participants