Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations #98

Merged
merged 35 commits into from Jul 4, 2018
Merged

Optimizations #98

merged 35 commits into from Jul 4, 2018

Conversation

pepyakin
Copy link
Collaborator

@pepyakin pepyakin commented Jun 18, 2018

Based on #97

Wasm isn't the greatest instruction set for interpretation. The reason for that is that Wasm is more optimized for ease of validation and compilation.

The central theme of this PR is a transformation of Wasm structured stack machine into a plain stack machine. This transformation is performed once before the interpretation alongside with validation.

See "isa.rs" for more details on this topic.

Beyond this there a couple of optimizations here and there.

With all of this, the overall increase in the performance is about almost 2x on some benchmarks.

Copy link
Contributor

@eira-fransham eira-fransham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really fantastic work here. I think it should be looked over by someone else too since I've not spent too much time on this code and I'm in a bit of a rush but I've got a few points for you to look at.

self.values
.back()
.get(len - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be .last and .last_mut

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol definitely!

src/isa.rs Outdated
/// Last one is the default.
///
/// Can be less than zero.
BrTable(Box<[Target]>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the pointer indirection here, it's not important to fix now but I wonder if we can do something about it. We could have BrTable have one field (the length) and then have a series of Br instructions afterwards that act as the list of targets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#100 !

/// Return from current function block.
Return,
Return(isa::DropKeep),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the drop/keep instructions be encoded as extra instructions at the end of the function (so that we don't have to have special-case handling)? That would mean we can make Branch and Return the same here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of special-case handling do you mean? Also, btw, drop/keep are not instructions per se, but immediate arguments of these instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special case handling of returning vs branching. Especially in a stack machine you shouldn't need to have a difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, that's an interesting idea, haven't thought about it. It requires some substantial changes though.

Firstly, all branches currently are static, i.e. they encode branch destination (and drop keep) as immediates. So we need an instruction that pops a return address and branches to it. (It seems that the only instruction that needs such kind of indirect jump, so why don't keep calling it return? : ) )

Secondly, all code currently contained in a separate vectors. So you can't branch from one function and land in code for another. Although keeping all code in a single vec is more efficient and desired. I didn't go with this way since it might be more error-prone and decided to leave it for later.

Also, when I thought about making drop/keep separate instructions, I decided not to, because large part of interpreters overhead comes from a dispatch. So it seems it is profitable to fuse several instructions into one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this gets merged I'll write a PR to try to convert it to use a flat vector. We can even generate the bytecode for each individual function in parallel and then flatten it at the end.

You've convinced me that this is fine to keep as a special case, though.


let mut function_stack = VecDeque::new();
function_stack.push_back(context);
let mut call_stack = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we cache this vec in self so that we don't need to realloc every call to start_execution? Especially useful because we start empty. We just call Vec::clear after run_interpreter_loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will not make any difference since for now, an Interpreter is disposable. That is, every call to FuncInvoke::invoke will create a new instance of the Interpreter and then drop it after it finishes the execution.

Thankfully, this happens only once when invoking a wasm module exported function, which should be pretty infrequent.

src/runner.rs Outdated
#[inline(always)]
fn run_instruction(&mut self, context: &mut FunctionContext, instruction: &isa::Instruction) -> Result<InstructionOutcome, TrapKind> {
match instruction {
&isa::Instruction::Unreachable => self.run_unreachable(context),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these run_* functions inlined by LLVM? That'd be a really good source of easy wins if not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about all of them, but I experimented with inlining of hot functions (such as set_local/get_local for fac-opt) and it didn't make a difference.

@@ -1252,65 +1108,92 @@ pub fn check_function_args(signature: &Signature, args: &[RuntimeValue]) -> Resu
Ok(())
}

#[derive(Debug)]
struct ValueStack {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Box<[RuntimeValue]> + usize and not Vec? This seems basically equivalent to a Vec.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's� actually for a reason. This way it's more explicit that we don't use Vec, and in particular, that we are trying to avoid Vec::push method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that's unnecessary and just leads to having to write many 0s into the Box<[RuntimeValue]>. If we used Vec we could use with_capacity and have the extra space be transparently lazilly allocated by jemalloc

Copy link
Collaborator Author

@pepyakin pepyakin Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that's unnecessary

What exactly? Do you mean avoiding using Vec::push?
AFAIR, using Vec in this case actually slows down quite a lot, but I'm not sure.

just leads to having to write many 0s into the Box<[RuntimeValue]>

But if we use Vec we still need to use resize, which is basically the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use resize instead of with_capacity? Surely one of the validation steps of the bytecode is to make sure that it doesn't read uninitialised stack data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use with_capacity only and leave len unchanged, then we basically can't use get, right?

Hm, I think I lost it somewhere and don't follow...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like this. It makes the code a lot simpler and jemalloc can lazily allocate the space for the values:

#[derive(Debug)]
struct ValueStack {
    buf: Vec<RuntimeValue>,
}

impl ValueStack {
    fn with_limit(limit: usize) -> ValueStack {
        let buf = Vec::with_capacity(limit);

        ValueStack { buf: buf }
    }

    #[inline]
    fn drop_keep(&mut self, drop_keep: isa::DropKeep) {
        if drop_keep.keep == isa::Keep::Single {
            let top = *self.top();
            *self.pick_mut(drop_keep.drop as usize + 1) = top;
        }

        let cur_stack_len = self.len();
        self.buf.truncate(cur_stack_len - drop_keep.drop as usize);
    }

    #[inline]
    fn pop_as<T>(&mut self) -> T
    where
        T: FromRuntimeValue,
    {
        let value = self.pop();
        value
            .try_into()
            .expect("Due to validation stack top's type should match")
    }

    #[inline]
    fn pop_pair_as<T>(&mut self) -> (T, T)
    where
        T: FromRuntimeValue,
    {
        let right = self.pop_as();
        let left = self.pop_as();
        (left, right)
    }

    #[inline]
    fn pop_triple(&mut self) -> (RuntimeValue, RuntimeValue, RuntimeValue) {
        let right = self.pop();
        let mid = self.pop();
        let left = self.pop();
        (left, mid, right)
    }

    #[inline]
    fn top(&self) -> &RuntimeValue {
        self.pick(1)
    }

    fn pick(&self, depth: usize) -> &RuntimeValue {
        &self.buf[self.buf.len() - depth]
    }

    #[inline]
    fn pick_mut(&mut self, depth: usize) -> &mut RuntimeValue {
        let old_len = self.buf.len();
        &mut self.buf[old_len - depth]
    }

    #[inline]
    fn pop(&mut self) -> RuntimeValue {
        // TODO: Use `get_unchecked` since we always have at least one value
        self.buf.pop().unwrap()
    }

    #[inline]
    fn push(&mut self, value: RuntimeValue) -> Result<(), TrapKind> {
        if self.buf.len() == self.buf.capacity() {
            return Err(TrapKind::StackOverflow);
        }
        self.buf.push(value);
        Ok(())
    }

    #[inline]
    fn len(&self) -> usize {
        self.buf.len()
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just re-run benchmarks with this version and they show that this version is indeed slower (not as slower as I expected though).

context.sink.emit(isa::Instruction::SetGlobal(index));
}

I32Load(align, offset) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you generate this with a macro? It seems error-prone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right but I would rather not touch this code! All instructions, which happened to be placed after TeeLocal, were mechanically and carefully edited with the help of multiline cursor feature :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code ever needs to be changed we have to convert it to be a macro. Will tests fail if we mess it up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code ever needs to be changed we have to convert it to be a macro.

Agree.

Will tests fail if we mess it up?

I guess so

acc = acc
.checked_add(locals_group.count())
.ok_or_else(||
Error(String::from("Locals range no in 32-bit range"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should say "not"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}

impl ValueStack {
fn with_limit(limit: usize) -> ValueStack {
let mut buf = Vec::new();
buf.resize(limit, RuntimeValue::I32(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we unify the value types as in #99 we can make this use calloc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the plan! (And because of this I've asked you about zero-ing FPs the other day : ) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I don't know whether it will work with a union but we can do Vec::<u64>::new() and then convert that to a Vec<RuntimeValue>. That'll definitely use calloc.

@@ -7,3 +7,6 @@ authors = ["Sergey Pepyakin <s.pepyakin@gmail.com>"]
wasmi = { path = ".." }
assert_matches = "1.2"
wabt = "0.3"

[profile.bench]
debug = true
Copy link
Contributor

@NikVolf NikVolf Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cachegrind, although this probably shouldn't be committed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, or Instruments in my case : )

src/runner.rs Outdated
.pop_as();
Ok(InstructionOutcome::Branch(table.get(index as usize).cloned().unwrap_or(default) as usize))

let dst =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange identation below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use rustfmt from here on? It would have to start with a new PR

Copy link
Collaborator Author

@pepyakin pepyakin Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to!

src/runner.rs Outdated
let table_func_idx: u32 = context
.value_stack_mut()
let table_func_idx: u32 = self
.value_stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line for this is overkill?

src/runner.rs Outdated
.value_stack_mut()
fn run_drop(&mut self) -> Result<InstructionOutcome, TrapKind> {
let _ = self
.value_stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto too many lines

.expect("Due to validation stack should contain pair of values");
let (left, right) = self
.value_stack
.pop_pair_as::<T>();
Copy link
Contributor

@NikVolf NikVolf Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ::<T> needed here ?

Copy link
Collaborator Author

@pepyakin pepyakin Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular place, yes, it is needed. Rustc has problems with infering the types.

let drop_keep = drop_keep_return(
&context.locals,
&context.value_stack,
&context.frame_stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing trailing comma ;)

@pepyakin pepyakin merged commit f6657ba into master Jul 4, 2018
@pepyakin pepyakin deleted the flat-stack branch July 4, 2018 07:08
@eira-fransham
Copy link
Contributor

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants