Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stack corruption caused by Fn call primitives #807

Merged
merged 1 commit into from Sep 18, 2020

Conversation

ruby0x1
Copy link
Member

@ruby0x1 ruby0x1 commented Sep 18, 2020

Excerpt from @munificent on the nature of the bug:

In runInterpreter, for performance, the vm caches an IP pointing into some bytecode.

All primitives except for .call do not touch Wren's own callstack. They run a little C code and return, so the array of CallFrames, their IPs, and the IP cached inside run() are not affected at all.

While runInterpreter() is running, the IP in the top CallFrame is not updated, so it gets out of sync. This is deliberate, since storing to a field is slow, but it means the value of that field is stale and doesn't represent where execution actually is at that point in time.

To get that field in sync, we use STORE_FRAME(), which stores the local IP value back into the IP field for the top CallFrame. The interpreter is careful to always call STORE_FRAME() before executing any code that pushes a new CallFrame onto the stack.

In particular, if you look around, you'll see that every place the interpreter calls wrenCallFunction() is preceded by a STORE_FRAME(). That is, except for the call to wrenCallFunction() in the call_fn() primitive. That's the bug.

The .call() method on Fn is special because it does modify the Wren call stack and the C code for that primitive directly calls wrenCallFunction(). When that happens, the correct IP for the current function, which lives only in runInterpreter()'s local variable gets discarded and you're left with a stale IP in the CallFrame.

Giving the function call primitives a different method type and having the case for that method type call STORE_FRAME() before invoking the primitive fixes the bug.


Benchmarks from this fix:
image

Excerpt from @munificent on the nature of the bug:

In runInterpreter, for performance, the vm caches an IP pointing into some bytecode.

All primitives except for `.call`, do not touch Wren's own callstack. They run a little C code and return, so the array of CallFrames, their IPs, and the IP cached inside run() are not affected at all.

While runInterpreter() is running, the IP in the top CallFrame is not updated, so it gets out of sync. This is deliberate, since storing to a field is slow, but it means the value of that field is stale and doesn't represent where execution actually is at that point in time.

To get that field in sync, we use STORE_FRAME(), which stores the local IP value back into the IP field for the top CallFrame. The interpreter is careful to always call STORE_FRAME() before executing any code that pushes a new CallFrame onto the stack.

In particular, if you look around, you'll see that every place the interpreter calls wrenCallFunction() is preceded by a STORE_FRAME(). That is, except for the call to wrenCallFunction() in the call_fn() primitive. That's the bug.

The .call() method on Fn is special because it does modify the Wren call stack and the C code for that primitive directly calls wrenCallFunction(). When that happens, the correct IP for the current function, which lives only in runInterpreter()'s local variable gets discarded and you're left with a stale IP in the CallFrame.

Giving the function call primitives a different method type and having the case for that method type call STORE_FRAME() before invoking the primitive fixes the bug.
@ruby0x1 ruby0x1 merged commit 86463ac into main Sep 18, 2020
@ruby0x1
Copy link
Member Author

ruby0x1 commented Sep 18, 2020

@mhermier feel free to take a look

@mhermier
Copy link
Contributor

This is basically a revert of 5f29a72, that fails tests because it forgets to check for runtime errors after calling method->as.primitive(vm, args);.
That change set should provides some of the earlier failing tests that were documented, to avoid future regressions.
With additional tests with all them passing, and performance numbers should be re-advertised, it is very likely to be good enough for merge for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants