Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compiler: lifecycle issue on finalizers with tables #2102

Closed
wants to merge 10 commits into from
Closed

Conversation

evacchi
Copy link
Contributor

@evacchi evacchi commented Feb 28, 2024

So this was definitely a journey :D This PR should fix #2039 for the old compiler, but really it addresses an issue that is unrelated to the file cache, or the compiler (well, not directly) for that matter. It was indeed related to the finalizer, but it is much more articulated than you'd think.

Indeed, the problem is a lifecycle issue that involves finalizers, and it definitely relates to invoking code that is no longer mmapped; however, it was not due to a module being unmapped while executing, but it is a module doing a call_indirect on a module that was collected.

The spect tests generally do not fail because the GC is not performed at each iteration, but if we force it to collect at the end of each unit test (ab4f3d8) the spec tests will fail very predictably (see https://github.com/tetratelabs/wazero/actions/runs/8086797452/job/22097337665 and https://github.com/tetratelabs/wazero/actions/runs/8086797455/job/22097342252)

In fact, by allowing a bit of output, we learn that one of the tests in linking.wast enters (->) a code segment with a given address (in this case 0x12aba0000) in assert_uninstantiable, then leaves it (<-), but then this segment is immediately released (!!), i.e. finalized. This is also where we learn that assert_return/line:388, later, fails spectacularly:

        === RUN   TestFileCache_compiler/spectest/linking.wast/assert_uninstantiable/line:371
        2024/02/28 16:54:11 -> 12aba0000
        2024/02/28 16:54:11 <- 12aba0000
        2024/02/28 16:54:11 !! 12aba0000
        === RUN   TestFileCache_compiler/spectest/linking.wast/assert_return/line:387
        2024/02/28 16:54:11 -> 12ab90000
        2024/02/28 16:54:11 <- 12ab90000
        === RUN   TestFileCache_compiler/spectest/linking.wast/assert_return/line:388
        2024/02/28 16:54:11 -> 12ab90000
        unexpected fault address 0x12aba0000
        fatal error: fault
        [signal SIGSEGV: segmentation violation code=0x2 addr=0x12aba0000 pc=0x12aba0000]

now how is it possible that 0x12ab90000 causes a fault at 0x12aba0000 ? that's because linking.wast:388 is referencing a function from another module (these are linking-related tests after all)

(assert_return (invoke $Ms "get table[0]") (i32.const 0xdead))

Now, last question. Why is segment 0x12aba0000 being released at all? Well, here's the kicker: the module in assert_uninstantiable/line:371 is instantiated with

_, err = r.InstantiateWithConfig(ctx, buf, wazero.NewModuleConfig())

but this flags the compiled module to be closed on a compilation failure:

mod, err = r.store.Instantiate(ctx, code.module, name, sysCtx, code.typeIDs)
if err != nil {
    // If there was an error, don't leak the compiled module.
    if code.closeWithModule {
       _ = code.Close(ctx) // don't overwrite the error
    }

But on close, the module gets deleted from the engine, and in turn

func (e *engine) deleteCompiledModule(module *wasm.Module) {
    e.mux.Lock()
    defer e.mux.Unlock()

    delete(e.codes, module.ID)

this removes the last actual reference from memory, making it eligible for a future collection.

So, the solution is not to use InstantiateWithConfig() and instead compile then instantiate. This won't auto-close the module, resolving the issue.

I am marking this as a draft because this indeed fixes the problem, but I have not written proper test for it yet.

Huge thanks to @achille-roussel and @ncruces for the help!! 🙏

EDIT meanwhile I added a proper reproducer, and we dug further into the issue. The real problem is that function objects hold a reference to a compiledCode that is independent from the one in compiledModule; so, when a compiledModule gets collected, the function ends up referencing an invalid memory address (no longer mmap'd). We tried putting the finalizer on *compiledCode instead, but this still won't work.

This might be the reason:

// FunctionInstanceReference implements the same method as documented on wasm.ModuleEngine.
func (e *moduleEngine) FunctionInstanceReference(funcIndex wasm.Index) wasm.Reference {
return uintptr(unsafe.Pointer(&e.functions[funcIndex]))
}

casting the unsafe.Pointer() to a uintptr (as @ncruces noted offline) makes it decay to a plain integer, so the GC will no longer account for it.

@evacchi evacchi changed the title spectest: ensure modules are not auto-closed on failure on assert_unlinkable spectest: ensure modules are not auto-closed on failure on assert_uninstantiable Feb 28, 2024
@mathetake
Copy link
Member

sorry, rather than fixing the spectest, I am strongly thinking that we should fix the lifecycle issue of engine's code in general -- e.g. holding the reference to the imported functions etc in the compiled module. anyway thank you for hunting down the root cause!

@evacchi
Copy link
Contributor Author

evacchi commented Feb 29, 2024

ok the root cause seems to be that compiledFunctions hold a reference to compiledCode but the finalizer is on compiledModule (which embeds *compiledCode); thus when the compiledModule is finalized, the function has a reference to the unmapped memory!

related #1608

Copy link
Collaborator

@achille-roussel achille-roussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing 🙌

We discussed on Slack and agreed that there was a reference problem, this is likely the fix.

Nice work 👏

Copy link
Collaborator

@ncruces ncruces left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, I missed your updated PR. 😔
This seems great. 👍
Sorry for the duplication!

@evacchi
Copy link
Contributor Author

evacchi commented Mar 1, 2024

Yeah sorry for not being clearer but it looks like this doesn't fix the reproducer, see the failing tests :/

@evacchi evacchi changed the title spectest: ensure modules are not auto-closed on failure on assert_uninstantiable compiler: lifecycle issue on finalizers with tables Mar 1, 2024
@evacchi
Copy link
Contributor Author

evacchi commented Mar 1, 2024

update: this might be the culprit with putting the finalizer on the *compiledCode in a904165

// FunctionInstanceReference implements the same method as documented on wasm.ModuleEngine.
func (e *moduleEngine) FunctionInstanceReference(funcIndex wasm.Index) wasm.Reference {
return uintptr(unsafe.Pointer(&e.functions[funcIndex]))
}

casting the unsafe.Pointer() to a uintptr (as @ncruces noted offline) makes it decay to a plain integer, so the GC will no longer account for it.

@evacchi
Copy link
Contributor Author

evacchi commented Mar 4, 2024

I followed @ncruces observation that setting Reference = unsafe.Pointer works; TBH this is not as invasive as it sounds because it makes 0 => nil and makes a little more explicit what really is happening; the added benefit is that unsafe.Pointers are actually tracked by the GC. Indeed this feels a bit invasive, but most changes are in the tests.

This is still rough at the edges (including some linting errors that I have intentionally left for now), I am pushing the commit just to give you an idea. Let me know what you think

EDIT: heh I also forgot about amd64. Well you get the idea.
EDIT2: done, also added some cosmetic change to show that this might not be as impactful as it looks; if anything in some places we have less casts and it is clearer that 0 == nilptr

tests have been blindly updated to pass, but they might be simplified / be useless now in some cases (?)

internal/engine/compiler/engine.go Outdated Show resolved Hide resolved
internal/engine/interpreter/interpreter.go Outdated Show resolved Hide resolved
internal/integration_test/spectest/spectest_test.go Outdated Show resolved Hide resolved
@evacchi
Copy link
Contributor Author

evacchi commented Mar 7, 2024

commit e9ad3fa temporarily reverts the patch to prove that the hammer test in e240147 demonstrates the issue in wazevo too. The solution seems to be, again, to use unsafe.Pointer ! Will restore e9ad3fa after the CI runs.

EDIT: proof https://github.com/tetratelabs/wazero/actions/runs/8183672859/job/22376887112

@evacchi
Copy link
Contributor Author

evacchi commented Mar 7, 2024

needs rebasing anyway due to #2130

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
…oid auto-close

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
…e` to avoid auto-close"

This reverts commit 5ddc150.

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
@evacchi
Copy link
Contributor Author

evacchi commented Mar 7, 2024

it does not look like wasm.Reference=unsafe.Pointer resolves the issue, in fact, it interacts in some ways with the race detector (as the original comments highlighted) and the hammer tests + -race seems to reliably fail even with the "fix" in place. Which, again, might be due to the race detector, or it might be because of a different issue.

At this time, I am not confident in this solution, but at least now we have a reproducer and we still know that it is a very specific edge case that we should be able to work around by deferring the closure of modules

[EDIT] more info: by reverting fde2d90 and 3458d18 the hammer test fails almost reliably, but with fde2d90 and 3458d18 it becomes just flakier, i.e. the current change doesn't seem to solve the issue completely and/or there is another issue at play.

@mathetake
Copy link
Member

I will take over this in a different PR!

@mathetake mathetake closed this Mar 8, 2024
@evacchi
Copy link
Contributor Author

evacchi commented Mar 8, 2024

thanks!

@mathetake mathetake deleted the finalizethis branch March 8, 2024 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flaky test: TestFileCache_{compiler, wazevo}
4 participants