Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trapping sporadically segfaults #70

Closed
dhil opened this issue Dec 30, 2023 · 4 comments
Closed

Trapping sporadically segfaults #70

dhil opened this issue Dec 30, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@dhil
Copy link
Member

dhil commented Dec 30, 2023

The test unhandled.wast occasionally triggers a segfault in the testsuite, e.g.

Caused by:
  process didn't exit successfully: `/home/dhil/projects/wasmfx/wasmtime/target/debug/deps/all-2197ac4314cc5afd` (signal: 11, SIGSEGV: invalid memory reference)

We should investigate the cause of this segfault. It appears related to trap generation and potentially propagation across multiple linked stacks.

@dhil dhil added the bug Something isn't working label Dec 30, 2023
@dhil
Copy link
Member Author

dhil commented Dec 30, 2023

Potentially relevant for PR #58.

@frank-emrich
Copy link

I've investigated this. The problem is that trace_through_wasm in backtrace.rs breaks due to how the trampolines we use to run fibers overwrite some data in VMRuntimeLimits.

The idea of trace_through_wasm is to follow a chain of frame pointers until hitting the stack frame belonging to the original beginning of the execution of wasm code, called trampoline_sp. Concretely, the function assumes that trampoline_sp denotes the stack pointer of the trampoline where execution of wasm code started, and that the currently running wasm frames are located immediately below it.

The value of trampoline_sp is obtained from the last_wasm_entry_sp field of the VMRuntimeLimits. Unfortunately, this means that whenever we start execution of a continuation with resume, the array calling trampoline that we use overwrites this value with the current stack pointer. In other words, the value of last_wasm_entry_sp may be a pointer inside the last fiber where we started executing a function as a continuation, which means that it can be entirely unrelated to whatever fiber has been switched to in the meantime.

Unfortunately, I don't see a simple way of avoiding that the last_wasm_entry_sp field gets overwritten: This is done by existing trampolines that we re-use, and can't easily change. With last_wasm_entry_sp therefore being overwritten, we don't know where the wasm stack frames end once reaching the main stack.

We could probably use the chain of ContinuationObject pointers in the VMContext to detect such situations and construct a backtrace for all frames in the chain of ContinuationObjects, but this backtrace would be incomplete, because we cannot include frames from the main stack: In order to detect when we reach the outermost wasm frame on the main stack, we would need to have access to the "original" value of last_wasm_entry_sp (i.e., when switching from the host into wasm code initially to start execution).

There is some logic in traphandlers.rs to model (what I think is) a chain of stacks used by the existing async implementation, but it's probably a bad idea to try to use that for our purposes (in particular, we would have to keep the parent pointers in yet another chain of stacks up to date when switching stacks).

@dhil
Copy link
Member Author

dhil commented Jan 5, 2024

Thanks for diagnosing this issue! I do recall thinking about trace_through_wasm for linked stacks in the past.

It reads to me that you suggest we implement some bespoke backtracing mechanism rather than trying to shoehorn or piggyback on the existing infrastructure. I think I agree with this sentiment at this stage.

I don't think it will be terrible difficult to implement. The key is to carefully record the last entry pointers, as you suggest, we could attach this on the continuation headers. I'm inclined to believe the backtracing metadata only should be available in debug mode.

@dhil
Copy link
Member Author

dhil commented May 14, 2024

Fixed in a previous PR by @frank-emrich.

@dhil dhil closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants