Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create bytecode compiler and evaluation #147

Closed
wants to merge 10 commits into from

Conversation

gmorenz
Copy link

@gmorenz gmorenz commented May 1, 2020

Hi,

This is a first pass at creating a bytecode implementation for the language. I noticed you (offhand) discussed making a bytecode in comments on another issue, so I'm hoping that you will be interested in merging this once it's more complete. Definitely feel free to let me know if you'd rather parts of this were designed or implemented differently, or if you don't feel that it's an appropriate feature in the first place.

This isn't done, but it's at the point where I'm pretty sure I will finish it, and it's beginning to be at the point where I don't want to be accidentally duplicating work. Right now cargo test --features="no_object no_index" -- --skip eval --skip test_timestamp --skip test_type_of passes and it hasn't been optimized at all.

My eventual goal for this is to use it in a game I'm writing to let players script their units actions. I'm hoping that the bytecode will improve performance when repeatedly executing the same script, give me a deterministic measure of work done by counting instructions executed, and let me implement stuff where user scripts can act as a generator that yields values and pauses/resumes.

So far none of those goals have been achieved, however I think there is a pretty clear path forwards from this implementation to all of them. I'm also not sure if the parts needed to achieve the second two goals will really belong in the core language, but they could live behind feature flags, or I could maintain a fork, since the amount of code needed for them should be pretty small.

TODO - In the immediate future

  • Set up testing properly, ideally so that tests test both executors and compare the output.
  • Implement objects, index, in, and type_of.
  • Maybe implement eval. How attached are you to this function? Another option might be to have any function that calls eval to execute as an ast instead.
  • Keep track of Positions. Possibly with a second source map array in the bytecode struct.
  • Set up benchmarks

TODO - But it could probably wait until after this is merged

  • Broaden the API to match the one for ASTs, e.g. by calling it with an existing Scope, calling arbitrary functions, merging multiple pieces of bytecode, and so on.
  • Improve efficiency of bytecode representation
    • Make each individual variant smaller (e.g. by interning Strings)
    • Store bytecode as an array of bytes, so that small instructions can take up less space than large ones. Or maybe make clever use of unions to do the same.
  • Improve efficiency of executor (I'm maintaining multiple stacks, since it was the easiest way to implement things, not the most performant. I haven't done any work on optimizing the function).
  • Improve efficiency of foreign function calls (This is probably going to be hard. Currently we search for functions at run-time because it depends on the types of the arguments. One option might be to add a type inference phase, then whenever we managed to determine the types we could directly specify the function in the bytecode).

TODO - Potential future projects

  • Maybe fuzz the two executors for differences
  • JIT bytecode to asm?

TODO - But not in this pull request and maybe not in this repository

  • Some form of Yield(Dynamic, State) return option that allows for storing the execution state and restarting at well.
  • Instruction counting in the executor, possibly weighting instructions with some form of approximate execution cost.
  • The option to pause the executor after so many instructions are executed (similar to Yield but without a value).

@schungx
Copy link
Collaborator

schungx commented May 1, 2020

This is great. You're probably referring to #100 where I think compiling Rhai down to a bytecode format with an interpreter may make it run faster for the really demanding applications (e.g. games).

Can I suggest you base your changes on https://github.com/schungx/rhai/tree/bytecodes where I do most of the on-going work to avoid merge conflicts later. That branch bytecodes is newly created so you can base your PR's.

I'll try to keep it up to date by occasionally merging in changes from master.

@schungx
Copy link
Collaborator

schungx commented May 1, 2020

Do you plan on retaining the AST as a valid execution format and generate bytecodes from the AST, or compile directly to bytecodes?

// TODO: Removed if !state.always_search
Expr::Variable(_, Some(index), _) =>
self.instructions.push(GetVariable{ index }),
// TODO: Is this ever used without eval?
Copy link
Collaborator

@schungx schungx May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. always_search is only ever set to true when there is an eval call and the number of variables in the scope changes. But this is necessary if we ever use eval. Or do you plan to disallow eval calls when compiling to bytecodes? That can work also.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of discussed below - disallowing/backing out to the AST version of the code seems like the simplest solution (which means outright disallowing in functions that want the ability to yield/call functions with the ability to yield - but as originally mentioned I'm not sure that's a particularly mainstream feature)

@schungx
Copy link
Collaborator

schungx commented May 1, 2020

Maybe implement eval. How attached are you to this function? Another option might be to have any function that calls eval to execute as an ast instead.

Not attached to it at all. Just put it in because it was dead simple in the beginning. Now I realize why people who work on JS optimizers curse the eval.

Since an eval script is ever only run once each time after it is generated, a good strategy is to defer to AST-walking. You still need to have an efficient strategy to deal with the additional variables in the scope throwing off your offsets though...

Set up benchmarks

I'm also wondering how much faster will bytecodes be as compared to AST-walking. Do you have any preliminary benchmarks?

Improve efficiency of foreign function calls

Yes, I've been thinking about it also. Profiling runs with Rhai show that function-name/arg-types hashing consistently shows up on top of IR counts. Therefore, dispatching to the correct function appears to be a hot-path bottleneck, although so far I haven't thought of a way to do it faster than hashing everything.

@gmorenz
Copy link
Author

gmorenz commented May 1, 2020

Can I suggest you base your changes on https://github.com/schungx/rhai/tree/bytecodes

Sure, will do.

Have you considered making schungx/rhai the main repo for the language?

Do you plan on retaining the AST as a valid execution format and generate bytecodes from the AST, or compile directly to bytecodes?

For my purposes I'll be compiling directly to bytecode. The ability to implement the aforementioned features is necessary, and I don't see how I could do that easily with an AST, not to mention that performance should be better.

If you're asking from a language perspective I assume AST compilation will stick around. At least for one off scripts without loops I think it should be faster. It's certainly too early to think about killing it.

Since an eval script is ever only run once each time after it is generated, a good strategy is to defer to AST-walking. You still need to have an efficient strategy to deal with the additional variables in the scope throwing off your offsets though...

The idea is to just leave the entire function containing the eval in the ast-executor code. Calling an ast function from bytecode should be no harder than calling any other foreign function. That should drop all the extra variables before returning to bytecode.

Do you have any preliminary benchmarks?

I just quickly set up a version of the fibonacci benchmark in response and out of curiosity.

test bench_iterations_fibonacci          ... bench: 286,836,385 ns/iter (+/- 2,348,435)
test bench_iterations_fibonacci_bytecode ... bench: 265,362,002 ns/iter (+/- 801,004)

Considering that no effort has gone into optimizing it, and that there are some easy looking wins, I'm pretty happy with it being faster at all.

Yes, I've been thinking about it also. Profiling runs with Rhai show that function-name/arg-types hashing consistently shows up on top of IR counts. Therefore, dispatching to the correct function appears to be a hot-path bottleneck, although so far I haven't thought of a way to do it faster than hashing everything.

There are probably very few overrides of any given function on average, and small linear searches are typically faster than hashes, so something like this might work well to speed up the search (note that the hot path doesn't touch a String at all):

struct FunctionsTable {
     function_names: HashMap<String, usize>,
     functions: Vec<Vec<FnType>>,
}

fn register_fn(table: &mut FunctionsTable, fn_name: String, fn: FnType) {
    if let Some(idx) = table.function_names.get(&fn_name) {
        table.functions[idx].push(fn)
    } else {
        table.insert(name, functions.len());
        functions.push(vec![fn]);
    }
}

enum ExprOrBytecode {
    Call{ fn_name_idx: usize, args: Vec<Dynamic> }
}

fn exec_call(table: &mut FunctionsTable, fn_name_idx: usize, args: Vec<Dynamic>) {
    let functions = table.functions[fn_name_idx];
    for function in functions {
         if accepts(function, args) { 
             return function(args)
         } 
    }
    return error
}

@schungx
Copy link
Collaborator

schungx commented May 1, 2020

Have you considered making schungx/rhai the main repo for the language?

Not really. I am mostly only tinkering based on the fabulous work of people before me.

For my purposes I'll be compiling directly to bytecode. The ability to implement the aforementioned features is necessary, and I don't see how I could do that easily with an AST, not to mention that performance should be better.

Compiling directly to bytecode will probably be a large surgery on the parser. Optimizing machine-generated scripts with spliced-in constants also works more easily on an AST than on bytecodes.

For your purposes, I'd assume that you would be compiling your users' scripts into bytecodes, and then persisting the bytecode stream so it can be reloaded later and no need to re-parse, so it might be more useful to keep the AST around.

That should drop all the extra variables before returning to bytecode.

That would then change the semantics of the language because you're allowed to introduce new variables via eval. You can forbid it in bytecodes though, but then probably the engine should raise a runtime error when an eval script runs a let in global scope instead of silently discarding the new variable.

Considering that no effort has gone into optimizing it, and that there are some easy looking wins, I'm pretty happy with it being faster at all.

And it looks like it is more consistent as well judging from the timing variations...

There are probably very few overrides of any given function on average, and small linear searches are typically faster than hashes, so something like this might work well to speed up the search

A search will have to compare all the TypeId's of all the arguments, if not for the function name, for each single function entry. By limiting only to one single integer type, you get rid of most overloading, so the typical code path may be comparing m x n TypId values where m is the number of arguments and n is the average number of overloading, typically 1. Considering that the average function has around 2 arguments, this may be faster than a hash.

In the pathological case of multiple overloading (e.g. arithmetic functions supporting all integer types), then it will be slower.

(note that the hot path doesn't touch a String at all):

Yes it does. You need to map a function name to its index. It can be pre-mapped if using only one AST, but when using multiple AST's plus user-loaded packages, you cannot predict the index of a function based on its name alone. In other words, you'll need to do a table lookup based on the name string. Of course, compiling to bytecodes assumes you won't add new functions later on, so you can freely do this optimization. Unless you want to give your users freedom to add/mix functions in bytecodes form, then you'll need to do the name lookups.

One way to handle it is to cache the name->index mapping and only incur this cost once, until the functions library is changed. Then the cache is invalidated.

@gmorenz
Copy link
Author

gmorenz commented May 1, 2020

Compiling directly to bytecode will probably be a large surgery on the parser. Optimizing machine-generated scripts with spliced-in constants also works more easily on an AST than on bytecodes.

For your purposes, I'd assume that you would be compiling your users' scripts into bytecodes, and then persisting the bytecode stream so it can be reloaded later and no need to re-parse, so it might be more useful to keep the AST around.

Ah, sorry, we're talking about different things.

I don't particularly have a problem with compiling "source -> AST -> bytecode", maybe someday I will want to optimize that but for now it seems fine. I just don't expect to ever actually execute the AST.

I'll probably persist the original source (for display and deduplication reasons) and persist/cache the bytecode... As long as I keep a version of the bytecode around I don't think I'll need the AST to reload it, but if it ends up being useful it's not an issue to store it as well.

That would then change the semantics of the language because you're allowed to introduce new variables via eval. You can forbid it in bytecodes though, but then probably the engine should raise a runtime error when an eval script runs a let in global scope instead of silently discarding the new variable.

Am I mistaken in thinking that eval inside a function can't introduce variables outside that function? Top level eval would mean backing out the top level "function" to the AST version just like everywhere else, but functions that don't themselves contain eval should still be fine?

In the pathological case of multiple overloading (e.g. arithmetic functions supporting all integer types), then it will be slower.

True, and these are really common operators...

Thinking more it might be possible to have our cake and eat it too, at ast/bytecode-build-time decide whether to go the hash route or the lookup route depending on the number of overloads. But since this isn't optimizing the most common type of call (arithmetic) it might not be worth the effort.

@schungx
Copy link
Collaborator

schungx commented May 1, 2020

Am I mistaken in thinking that eval inside a function can't introduce variables outside that function? Top level eval would mean backing out the top level "function" to the AST version just like everywhere else, but functions that don't themselves contain eval should still be fine?

That's the point. eval is essentially equivalent to running the script code in-place, including defining new variables. Except that functions cannot be defined in eval.

At top level, eval will introduce new variables. At block level (such as a {} block or inside a function), any new variables introduced will be local. That's why always_search is reset to false at the end of a block.

@gmorenz
Copy link
Author

gmorenz commented May 1, 2020

(note that the hot path doesn't touch a String at all):

Yes it does. You need to map a function name to its index. It can be pre-mapped if using only one AST, but when using multiple AST's plus user-loaded packages, you cannot predict the index of a function based on its name alone. In other words, you'll need to do a table lookup based on the name string. Of course, compiling to bytecodes assumes you won't add new functions later on, so you can freely do this optimization. Unless you want to give your users freedom to add/mix functions in bytecodes form, then you'll need to do the name lookups.

One way to handle it is to cache the name->index mapping and only incur this cost once, until the functions library is changed. Then the cache is invalidated.

Hmm, yes, it does require some sort of linking step that is going to take time O(code_size). That's probably a reasonable thing to do in an AST optimization pass though. The bytecode version already has a pass like this called resolve_fn_calls used to optimize bytecode->bytecode calls.

@schungx
Copy link
Collaborator

schungx commented May 1, 2020

test bench_iterations_fibonacci ... bench: 286,836,385 ns/iter (+/- 2,348,435) test bench_iterations_fibonacci_bytecode ... bench: 265,362,002 ns/iter (+/- 801,004)

And Fibonacci is probably not a good benchmark to test for your use case as it is almost exclusively recursion. The workload will be on stack management which you probably can't save much.

You'll want to use speed_test or primes (if you have arrays) to benchmark calculation and loop-heavy scenarios.

@schungx
Copy link
Collaborator

schungx commented May 6, 2020

@gmorenz Unfortunately I added modules support so you're going to have to deal with functions and variables qualified by modules (and possibly a way to load modules that are pre-compiled to bytecodes).

I can help out with some of the infrastructure and plumbing after you get bytecodes to work without modules -- meaning it compiles under no_module -- that would probably be enough.

@gmorenz
Copy link
Author

gmorenz commented May 6, 2020

@gmorenz Unfortunately I added modules support so you're going to have to deal with functions and variables qualified by modules (and possibly a way to load modules that are pre-compiled to bytecodes).

Ya, that's no problem. Some form of modules were on my wants for this language anyways, and I definitely already have ideas about how to implement modules and linking of modules.

Current work here is still on creating a clean and efficient index chain model... it turns out that the semantics don't fit a stack machine particularly cleanly. It's close to done, but I got sidetracked on the get_mut idea.

@schungx
Copy link
Collaborator

schungx commented May 6, 2020

Current work here is still on creating a clean and efficient index chain model... it turns out that the semantics don't fit a stack machine particularly cleanly. It's close to done, but I got sidetracked on the get_mut idea.

Yes, array indexing is hairy especially when the language doesn't have pointers... otherwise it is a piece of cake, but then you'll have to probably throw in a GC...

The problem with implementing it with a stack machine is that you always want need to keep a back-pointer to the original object somewhere down the stack so that, once you get all the index values on the stack, you can restart from the beginning and work your way up the top of the stack.

So some form of a frame-pointer to keep a stack frame for the indexing operation...

@gmorenz
Copy link
Author

gmorenz commented May 11, 2020

As mentioned in schungx/11 I do not anticipate making time to finish this work anytime soon. Closing this for now. Feel free to re-use any part of this in any way if you so desire.

Thanks for the warm welcome @schungx.

@gmorenz gmorenz closed this May 11, 2020
@schungx schungx mentioned this pull request Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants