New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations #98
Optimizations #98
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really fantastic work here. I think it should be looked over by someone else too since I've not spent too much time on this code and I'm in a bit of a rush but I've got a few points for you to look at.
src/common/stack.rs
Outdated
self.values | ||
.back() | ||
.get(len - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be .last
and .last_mut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol definitely!
src/isa.rs
Outdated
/// Last one is the default. | ||
/// | ||
/// Can be less than zero. | ||
BrTable(Box<[Target]>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike the pointer indirection here, it's not important to fix now but I wonder if we can do something about it. We could have BrTable
have one field (the length) and then have a series of Br
instructions afterwards that act as the list of targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#100 !
/// Return from current function block. | ||
Return, | ||
Return(isa::DropKeep), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the drop
/keep
instructions be encoded as extra instructions at the end of the function (so that we don't have to have special-case handling)? That would mean we can make Branch
and Return
the same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of special-case handling do you mean? Also, btw, drop
/keep
are not instructions per se, but immediate arguments of these instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Special case handling of returning vs branching. Especially in a stack machine you shouldn't need to have a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, that's an interesting idea, haven't thought about it. It requires some substantial changes though.
Firstly, all branches currently are static, i.e. they encode branch destination (and drop keep) as immediates. So we need an instruction that pops a return address and branches to it. (It seems that the only instruction that needs such kind of indirect jump, so why don't keep calling it return? : ) )
Secondly, all code currently contained in a separate vectors. So you can't branch from one function and land in code for another. Although keeping all code in a single vec is more efficient and desired. I didn't go with this way since it might be more error-prone and decided to leave it for later.
Also, when I thought about making drop/keep separate instructions, I decided not to, because large part of interpreters overhead comes from a dispatch. So it seems it is profitable to fuse several instructions into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this gets merged I'll write a PR to try to convert it to use a flat vector. We can even generate the bytecode for each individual function in parallel and then flatten it at the end.
You've convinced me that this is fine to keep as a special case, though.
|
||
let mut function_stack = VecDeque::new(); | ||
function_stack.push_back(context); | ||
let mut call_stack = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we cache this vec in self
so that we don't need to realloc every call to start_execution
? Especially useful because we start empty. We just call Vec::clear
after run_interpreter_loop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will not make any difference since for now, an Interpreter
is disposable. That is, every call to FuncInvoke::invoke
will create a new instance of the Interpreter
and then drop it after it finishes the execution.
Thankfully, this happens only once when invoking a wasm module exported function, which should be pretty infrequent.
src/runner.rs
Outdated
#[inline(always)] | ||
fn run_instruction(&mut self, context: &mut FunctionContext, instruction: &isa::Instruction) -> Result<InstructionOutcome, TrapKind> { | ||
match instruction { | ||
&isa::Instruction::Unreachable => self.run_unreachable(context), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these run_*
functions inlined by LLVM? That'd be a really good source of easy wins if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about all of them, but I experimented with inlining of hot functions (such as set_local
/get_local
for fac-opt
) and it didn't make a difference.
@@ -1252,65 +1108,92 @@ pub fn check_function_args(signature: &Signature, args: &[RuntimeValue]) -> Resu | |||
Ok(()) | |||
} | |||
|
|||
#[derive(Debug)] | |||
struct ValueStack { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Box<[RuntimeValue]>
+ usize
and not Vec
? This seems basically equivalent to a Vec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's� actually for a reason. This way it's more explicit that we don't use Vec
, and in particular, that we are trying to avoid Vec::push
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that's unnecessary and just leads to having to write many 0
s into the Box<[RuntimeValue]>
. If we used Vec
we could use with_capacity
and have the extra space be transparently lazilly allocated by jemalloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that's unnecessary
What exactly? Do you mean avoiding using Vec::push
?
AFAIR, using Vec
in this case actually slows down quite a lot, but I'm not sure.
just leads to having to write many 0s into the Box<[RuntimeValue]>
But if we use Vec
we still need to use resize
, which is basically the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use resize
instead of with_capacity
? Surely one of the validation steps of the bytecode is to make sure that it doesn't read uninitialised stack data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use with_capacity
only and leave len
unchanged, then we basically can't use get
, right?
Hm, I think I lost it somewhere and don't follow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like this. It makes the code a lot simpler and jemalloc can lazily allocate the space for the values:
#[derive(Debug)]
struct ValueStack {
buf: Vec<RuntimeValue>,
}
impl ValueStack {
fn with_limit(limit: usize) -> ValueStack {
let buf = Vec::with_capacity(limit);
ValueStack { buf: buf }
}
#[inline]
fn drop_keep(&mut self, drop_keep: isa::DropKeep) {
if drop_keep.keep == isa::Keep::Single {
let top = *self.top();
*self.pick_mut(drop_keep.drop as usize + 1) = top;
}
let cur_stack_len = self.len();
self.buf.truncate(cur_stack_len - drop_keep.drop as usize);
}
#[inline]
fn pop_as<T>(&mut self) -> T
where
T: FromRuntimeValue,
{
let value = self.pop();
value
.try_into()
.expect("Due to validation stack top's type should match")
}
#[inline]
fn pop_pair_as<T>(&mut self) -> (T, T)
where
T: FromRuntimeValue,
{
let right = self.pop_as();
let left = self.pop_as();
(left, right)
}
#[inline]
fn pop_triple(&mut self) -> (RuntimeValue, RuntimeValue, RuntimeValue) {
let right = self.pop();
let mid = self.pop();
let left = self.pop();
(left, mid, right)
}
#[inline]
fn top(&self) -> &RuntimeValue {
self.pick(1)
}
fn pick(&self, depth: usize) -> &RuntimeValue {
&self.buf[self.buf.len() - depth]
}
#[inline]
fn pick_mut(&mut self, depth: usize) -> &mut RuntimeValue {
let old_len = self.buf.len();
&mut self.buf[old_len - depth]
}
#[inline]
fn pop(&mut self) -> RuntimeValue {
// TODO: Use `get_unchecked` since we always have at least one value
self.buf.pop().unwrap()
}
#[inline]
fn push(&mut self, value: RuntimeValue) -> Result<(), TrapKind> {
if self.buf.len() == self.buf.capacity() {
return Err(TrapKind::StackOverflow);
}
self.buf.push(value);
Ok(())
}
#[inline]
fn len(&self) -> usize {
self.buf.len()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just re-run benchmarks with this version and they show that this version is indeed slower (not as slower as I expected though).
context.sink.emit(isa::Instruction::SetGlobal(index)); | ||
} | ||
|
||
I32Load(align, offset) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you generate this with a macro? It seems error-prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right but I would rather not touch this code! All instructions, which happened to be placed after TeeLocal
, were mechanically and carefully edited with the help of multiline cursor feature :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code ever needs to be changed we have to convert it to be a macro. Will tests fail if we mess it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code ever needs to be changed we have to convert it to be a macro.
Agree.
Will tests fail if we mess it up?
I guess so
src/validation/util.rs
Outdated
acc = acc | ||
.checked_add(locals_group.count()) | ||
.ok_or_else(|| | ||
Error(String::from("Locals range no in 32-bit range")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should say "not"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
} | ||
|
||
impl ValueStack { | ||
fn with_limit(limit: usize) -> ValueStack { | ||
let mut buf = Vec::new(); | ||
buf.resize(limit, RuntimeValue::I32(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we unify the value types as in #99 we can make this use calloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's the plan! (And because of this I've asked you about zero-ing FPs the other day : ) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I don't know whether it will work with a union but we can do Vec::<u64>::new()
and then convert that to a Vec<RuntimeValue>
. That'll definitely use calloc
.
@@ -7,3 +7,6 @@ authors = ["Sergey Pepyakin <s.pepyakin@gmail.com>"] | |||
wasmi = { path = ".." } | |||
assert_matches = "1.2" | |||
wabt = "0.3" | |||
|
|||
[profile.bench] | |||
debug = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cachegrind, although this probably shouldn't be committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, or Instruments in my case : )
src/runner.rs
Outdated
.pop_as(); | ||
Ok(InstructionOutcome::Branch(table.get(index as usize).cloned().unwrap_or(default) as usize)) | ||
|
||
let dst = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange identation below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use rustfmt
from here on? It would have to start with a new PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to!
src/runner.rs
Outdated
let table_func_idx: u32 = context | ||
.value_stack_mut() | ||
let table_func_idx: u32 = self | ||
.value_stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line for this is overkill?
src/runner.rs
Outdated
.value_stack_mut() | ||
fn run_drop(&mut self) -> Result<InstructionOutcome, TrapKind> { | ||
let _ = self | ||
.value_stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto too many lines
.expect("Due to validation stack should contain pair of values"); | ||
let (left, right) = self | ||
.value_stack | ||
.pop_pair_as::<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ::<T>
needed here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular place, yes, it is needed. Rustc has problems with infering the types.
src/validation/func.rs
Outdated
let drop_keep = drop_keep_return( | ||
&context.locals, | ||
&context.value_stack, | ||
&context.frame_stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing trailing comma ;)
🎉 |
Based on #97
Wasm isn't the greatest instruction set for interpretation. The reason for that is that Wasm is more optimized for ease of validation and compilation.
The central theme of this PR is a transformation of Wasm structured stack machine into a plain stack machine. This transformation is performed once before the interpretation alongside with validation.
See "isa.rs" for more details on this topic.
Beyond this there a couple of optimizations here and there.
With all of this, the overall increase in the performance is about almost 2x on some benchmarks.