Techniques for writing interpreter loops #3

yorickpeterse · 2018-04-13T15:22:23Z

For interpreters one of the core components is the interpreter/dispatcher loop. In it's most basic way this is essentially a loop combined with a (giant) match, but depending on the language there are other techniques available. I think it's worth collecting the various approaches available to Rust, their benefits, drawbacks, etc. Over time we should probably extend this into a document outlining this more clearly, including examples and what not.

@pliniker wrote a bit about this in https://pliniker.github.io/post/dispatchers/, and I discussed things a bit in https://www.reddit.com/r/rust/comments/66h3t2/alternatives_to_dispatching_using_loop_match/.

Available Techniques

loop + match
- 👍 Easy to implement
- 👍 Portable (since it's just regular Rust code)
- 👍 Fairly easy to understand, assuming the match isn't 5 000 lines long
- 👎 Performance may vary depending on the CPU branch predictor, how Rust/LLVM compiles the code, etc
- 👎 In my personal experience the performance can also be influenced by the number of arms, though this has never been consistent. Sometimes adding an arm would slow things down, then it would speed up again once I added one or two more.
loop plus some form of inline assembly
- 👍 Probably as fast as you can get things
- 👎 Not portable as you need to write different ASM snippets for different platforms
- 👎 Very fragile due to how LLVM operates with ASM
- 👎 Harder to maintain as you now need to know both Rust and ASM
computed goto
- 👍 About as fast as the ASM approach
- 👍 Assuming the language supports it you can write the code once for all platforms
- 👎 Rust doesn't support this since it doesn't play well with the borrow checker, and probably won't for a very long time (if ever)
- 👎 Can be a bit hard to wrap your head around
Recursive function calls using TCO
- 👍 Good performance, though I don't remember how well it performs compared to the other techniques
- 👍 Doesn't require messing with ASM or goto
- 👎 Only enabled when building in release mode
- 👎 Only available on x86_64 if I remember correctly
- 👎 It requires that every function is a separate function, which can make it very hard to control the outer loop (e.g. a break in a function won't work). This means that to control the loop you'd still need some kind of match

There are probably more, but these are the ones I can think of at the moment.

The text was updated successfully, but these errors were encountered:

pliniker · 2018-04-13T18:46:55Z

In this irlo forum discussion stocklund explained how clang and LLVM are able to optimize a switch to a computed-gotos structure.

It seems possible that rustc could be modified to do the same.

I'm not sure about some of the reasoning in the post as to why this shouldn't be an issue anymore, because modern CPUs are good enough at branch prediction. My testing did show that low power CPU's branch predictors are still a way behind and that computed gotos continue to measurably outperform other methods there.

yorickpeterse · 2018-04-13T20:08:45Z

@pliniker I imagine this being less of an issue on newer architectures, but I wouldn't be surprised if the problem persists on mobile/embedded devices (e.g. those running ARM). Even if it's no longer a problem I think it's worth documenting, including data to show how things behave.

pliniker · 2018-04-13T20:12:58Z

My dispatch experiments did show that ARM processors do much better with computed gotos.

I also am curious to see how Spectre mitigation microcode updates will affect branch prediction - I had read suggestions that there would be a decrease in performance of high-end Intel CPUs branch predictors but I have no data on that yet.

cipriancraciun · 2018-04-21T14:08:18Z

A small comment regarding the available techniques: if one uses Rust as a language backend, it means that its "safety" is one of the main features that the developer had in mind; therefore any solution that "breaks" this "safety", especially for such a core component of a language VM, practically nulls the Rust advantage. As such the only variants that make sense are those that leverage "native Rust", i.e. loop+match and recursive function calls with TCO.

BTW: is there an example on how such a loop based on TCO would look like in Rust? And what is the performance advantage over the loop+match?

(Update: @pliniker 's article https://pliniker.github.io/post/dispatchers/ , also linked in the initial comment, provides such an example.)

Regarding the loop+match, I have a design and performance dilemma: say one has 100 (i.e. a large number) of variants; should one:

use a single enum, and thus a match with all 100 arms,
or should one split the enum in 10 smaller enum's, each with say 10 variants each, thus having a nested match?

Without knowing the internals on how Rust compiles match down to assembler, I would assume that the first approach (i.e. one single match) might be more efficient, if the compiler uses "computed goto" or binary search; instead if the compiler -- which I doubt -- looks at each arm in a sequence, then the second approach might be more efficient.

For example in my interpreter I decided to go with the second approach due to readability.

the variant with more enum's:
https://github.com/cipriancraciun/vonuvoli-scheme/blob/a52d06c97b9ff22f5eedc752a533bc97da324b13/sources/expressions.rs#L53
the variant with a single enum:
https://github.com/cipriancraciun/vonuvoli-scheme/blob/dce37aee25588c56d07fa54bfbdc43fbe1f86c52/sources/expressions.rs#L24

However after doing that refactoring (i.e. transforming the 100 enum into 10 smaller enum's) I did notice a performance hit in my interpreter.

As related to writing interpreter loops, there are two additional topics:

how to efficiently implement "continuations" (as per Scheme definition), or "coroutines" (as per Python definition);
how to efficiently implement "green-threads" (as per Go "goroutines"), or at a much larger scale as per Erlang's "processes";

madmalik · 2018-04-22T13:08:56Z

A small comment regarding the available techniques: if one uses Rust as a language backend, it means that its "safety" is one of the main features that the developer had in mind; therefore any solution that "breaks" this "safety", especially for such a core component of a language VM, practically nulls the Rust advantage.

The interpreter loop may be a core component, but it's also quite self contained and easily testable. Imo it'd be a prime example where a bit of unsafe code is justified.

cipriancraciun · 2018-04-22T13:29:28Z

The interpreter loop may be a core component, but it's also quite self contained and easily testable. Imo it'd be a prime example where a bit of unsafe code is justified.

I don't debate that the "loop" might be easily tested and formally checked, however you are taking one thing for granted: that the input (i.e. the "bytecode") is "correct" (i.e. by the "loop" specifications). Thus you are ignoring one major component which might not be as simple to test and check: the "compiler" and then the "optimizer" (or any other component that creates the "bytecode").

Think about this: I bet someone has formally proved that, say a subset of, a given real assembler language is "correct", and therefore one might expect that this property translates to the "machine code" being run. Which in practice doesn't happen, so given the excessive number of buffer over- and under-runs...

madmalik · 2018-04-23T08:01:53Z

I don't debate that the "loop" might be easily tested and formally checked, however you are taking one thing for granted: that the input (i.e. the "bytecode") is "correct" (i.e. by the "loop" specifications).

Maybe i overlook something, but afaik this is not really a problem. There is a lookup from bytecode to function pointer. If this dispatch is correct and the functions that are jumped are correctly implemented, we're back at safe operations. The only "attack" that would be possible from the bytecode is sending an invalid opcode where for which there is no function pointer in the table. But that is not hard or expensive to verify beforehand. It also might be possible (if we use only a byte-range for the opcode) to just fill up the table with halt commands.

I've downladed @plinikers dispatch crate and played with it a little bit. The simple match loop variant is... special. With the right amount of useless match-arms it really performs quite well. But i've only a haswell laptop to test on.
I'd like to test direct threading, but that would be incompatible with the bytecode format in that crate.

yorickpeterse · 2018-04-23T16:00:12Z

If it helps:

Inko's VM loop uses some unsafe code, but this is mostly to work around some weird borrow checking limitations. This is because the VM will borrow a few references and I couldn't get Rust to stop complaining about some of those being replaced in the inner loop (e.g. when jumping around functions). This isn't as pretty as I would like it to be, but it's really not that bad either. At it's very core however the logic boils down to this:

let instructions = X
let mut index = 0

loop {
  let instruction = instructions[index]

  match instruction.instruction_type {
    ...
  }

  index += 1
}

The match takes care of breaking out of the loop whenever necessary.

In case of Inko I went with the assumption that the bytecode is valid by the time we run it, and there's some basic validation going on when parsing bytecode (enough that we can't at least end up with out of bounds array access). I have been meaning to add a more in depth bytecode verifier at some point, but its priority is super low.

A slightly safer alternative would be:

let instructions = X
let mut index = 0

while index < instructions.len() {
  let instruction = instructions[index]

  match instruction.instruction_type {
    ...
  }

  index += 1
}

I however found that in Inko this added some unnecessary overhead, so I went with the first snippet.

When documenting interpreter techniques I think it's better if we err on the side of safety / less performance, but clearly state how one might be able to further optimise the code (possibly at the cost of safety). This way if somebody blindly copy-pastes the code they don't expose themselves to all kinds of nasty issues.

madmalik · 2018-04-24T07:41:04Z

I've a question regarding JIT compilation: My (probably wrong and/or simplistic) understanding is that one builds bits off assembly associated with each op-code and than unrolls the interpreter loop by copying these bits off assembly for each op-code into a buffer.

If that is the case, would it be possible to unify that with the "loop + inline assembly" approach? If we have to hardcode each operation in asm anyway, why not reuse it for the interpreter loop?

yorickpeterse changed the title ~~Interpreter loops~~ Techniques for writing interpreter loops Apr 13, 2018

yorickpeterse mentioned this issue Apr 25, 2018

Interpreter Loop rust-hosted-langs/book#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Techniques for writing interpreter loops #3

Techniques for writing interpreter loops #3

yorickpeterse commented Apr 13, 2018

pliniker commented Apr 13, 2018

yorickpeterse commented Apr 13, 2018 •

edited

Loading

pliniker commented Apr 13, 2018

cipriancraciun commented Apr 21, 2018 •

edited

Loading

madmalik commented Apr 22, 2018

cipriancraciun commented Apr 22, 2018

madmalik commented Apr 23, 2018

yorickpeterse commented Apr 23, 2018

madmalik commented Apr 24, 2018

Techniques for writing interpreter loops #3

Techniques for writing interpreter loops #3

Comments

yorickpeterse commented Apr 13, 2018

Available Techniques

pliniker commented Apr 13, 2018

yorickpeterse commented Apr 13, 2018 • edited Loading

pliniker commented Apr 13, 2018

cipriancraciun commented Apr 21, 2018 • edited Loading

madmalik commented Apr 22, 2018

cipriancraciun commented Apr 22, 2018

madmalik commented Apr 23, 2018

yorickpeterse commented Apr 23, 2018

madmalik commented Apr 24, 2018

yorickpeterse commented Apr 13, 2018 •

edited

Loading

cipriancraciun commented Apr 21, 2018 •

edited

Loading