Bunny-JIT

This is a tiny optimising SSA-based JIT backend, currently targeting x64 and arm64. The Makefile expects either Unix environment (and libtool) or Windows with clang (and llvm-lib) in path, but there is no real build magic (just compile everything in src/ really).

The arm64 backend is very new, so it might still have some issue (well, more issues than the x64 backend), but at least on macOS/M1 it seems to be passing the tests.

The project status right now is that I'm reasonably happy with the current set of optimizations (for the time being, at least) and the focus right now is mostly on simplifying existing functionality and extending test-coverage to expose more issues. Feel free to experiment, but please expect to still find some serious bugs.

Please don't use it for production yet. Please contribute more tests first.

Warning: API changes loads/stores take off16 and stores now take value first.

Features:

small and simple (by virtue of elegant design), yet tries to avoid being naive
portable¹ C++11 without dependencies (other than STL)
uses low-level portable instruction set that models common architectures
supports integers, single- and double-floats (probably SIMD at some point)
end-to-end SSA, with consistency checking² and simple interface to generate valid SSA
performs roughly³ DCE, GCSE/LICM/PRE, CF/CP (SCCP?) and register allocation (as of now)
assembles to native x64 binary code with simple module system that supports hot-patching
uses std::vector to manage memory, keeps asan⁴ happy, tries to be cache efficient

¹Obviously loading code on the fly is not entirely portable (we are fully W^X compliant), but we support generic mmap/mprotect (eg. Linux, macOS, etc) and Windows (now fixed and tested as well).

²This doesn't guarantee correctness, but problems with optimizations are usually much more likely to assert than to silently produce incorrect code.

³ This is a bit hand-wavy, because traditional optimizations are formulated in terms of variables, yet we optimize purely on SSA values, but this is roughly what we get. There are some limitations with PRE/SCCP in the name of simplicity, but we should get most of the high-value situations; see below for details.

⁴ I used to run this regularly under valgrind too, but it no longer works on macOS. I'll revisit the topic (testing on Linux) once this gets closer to production ready.

I suggest looking at the tests (eg. test_fib.cpp or test_sieve.cpp) for examples of how to use the programming API, which is the primary focus of this library. Bunny-JIT comes with some sort of simple front-end language, but this is intended more for testing (and I guess example) than as a serious programming language and there is a pretty good chance that it'll eventually be retired/replaced. The test-driver bin/bjit parses this simple language from stdin and compiles it into native code, which is written to out.bin for disassembly purposes (eg. with ./dump-bin.sh if you have gobjdump in path).

You can certainly run it too, but you'll have to copy it to executable memory (which must also be readable, but not writable; we place constants directly after code in order to avoid having to relocate a separate .rodata).

There is now bjit::Module that can do this for you, but I haven't got around to updating the front-end yet.

Why Bunny?

Bunnies are small and cute and will take over the world in the near future.

Bunny-JIT is intended for situations where it is desirable to create some native code on the fly (eg. for performance reasons), but including something like LLVM would be a total overkill. Why add a gigabyte of dependencies, if you can get most of the high-value stuff with less than 10k lines of sparse C++?

Our goal is not necessarily to produce the best possible code (you should probably use LLVM for that), but rather to produce something that is good enough to make dynamic code-generation worth the trouble, while simple to use (both in terms of API and in terms of including into a project; I'm even considering an STB-style "single-file" version once the whole thing can be considered somewhat stable).

It is primarily intended for generating run-time specialized code, especially where this can give significant performance advantages, so we're willing to spend some time on optimization, but because we're still aiming at interactive uses we try not to go crazy heuristics. Instead we aim to find a set of general optimizations that are reasonably efficient and always lead to a fixed-point. This rules out optimizations such as loop-unrolling where profitability is not clear.

It can also be used as a backend for custom languages. It might not be great for dynamic languages that rely heavily on memory optimizations (we don't optimize loads and stores, except simple CSE on loads), but even then it might serve as a decent prototype backend.

License?

I should paste this into every file, but for the time being:

/****************************************************************************\
* Bunny-JIT is (c) Copyright pihlaja@signaldust.com 2021-2024                *
*----------------------------------------------------------------------------*
* You can use and/or redistribute this for whatever purpose, free of charge, *
* provided that the above copyright notice and this permission notice appear *
* in all copies of the software or it's associated documentation.            *
*                                                                            *
* THIS SOFTWARE IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY. USE AT YOUR OWN    *
* RISK. IN NO EVENT SHALL THE COPYRIGHT HOLDER BE HELD LIABLE FOR ANYTHING.  *
\****************************************************************************/

The Makefile is not covered by this license and can be considered public domain.

How to build?

On Unix-like system or Windows with clang (and libtool on Unix) installed, simply run make (or make -j). Ideally that's all.

In case you need local overrides, the Makefile includes local.make if it exists.

We override clang for CC but if BJIT_USE_CC is defined, then this is used; use this if you don't want to use clang for some weird reason. If BJIT_BUILDDIR and/or BJIT_BINDIR are defined, these will be used instead of build/ and bin/. If it fails to build because of -Werror then please report it because I'd really like this to compile clean with all the popular compilers.

By default the Makefile is set to compile with -fno-exceptions (and BJIT_ASSERT will simply call assert) but if you enable exceptions then BJIT_ASSERT will throw bjit::internal_error on failures instead.

There is also make test that will build everything and then run run-tests.sh to do some basic sanity checking (very limited for now). This won't quite work on Windows though (it should build tests, but we don't have a run-tests.bat yet).

Any source files in src/ are compiled and collected to build/bjit.a and each tests/<name>.cpp is compiled into a separate bin/<name> for testing purposes. Any source files in front/ are compiled into bin/bjit which you can use after building to test the library with the interactive front-end parser, but note that this doesn't actually run the code (yet), it just compiles it into out.bin.

There should be no need to touch the Makefile at all just to add additional files unless you also need a new build-target (other than for a new test in tests/ as those get their build-targets automatically).

Should you somehow run into issues with automatic dependencies, type make clean to start fresh. Should not happen, but just in case. I hereby place the Makefile into public domain so please adapt it for your own projects if you don't know how to write one properly.

Contributing?

I would recommend opening an issue before working on anything too significant, because this will greatly increase the chance that your work is still useful by the time you're ready to send a pull-request.

The reason for this is that the project is still young enough that I might make some drastic changes on a regular basis. I also recommend opening an issue if you want to start building something on top of it, so we can discuss stability of interfaces (which for the time being are subject to change without notice).

The omena branch is my laptop's remote and generally contains the latest and greatest and most broken code.

Code of conduct?

Github thinks having "code of conduct" is important, so here's one:

This project is a safe-space for minorities of any kind.

This project is not a safe-space for SJWs. Go away.

Instructions?

The first step is to include src/bjit.h.

The second step is to create a bjit::Proc which takes a stack allocation size and a string representing arguments (i for integer, f for single floats, d for double). The allocated block will always be the SSA value Value{0} (in practice this is the stack pointer) and the arguments will be placed in env[0..n] (left to right, at most 4 for now). More on env below. Pass 0 and "" if you don't care about allocations or arguments. Note that we don't have an LLVM-like mem2reg optimization, so you should not place variables in memory unless you need to index an array or pass a pointer somewhere. Put them into env instead.

To generate instructions, you can then call instruction methods on Proc. The last instruction of every block must be either a jump (conditional or unconditional), a return instruction or a tail-call (DCE can cleanup dead tails though, so a front-end can safely generate extra jumps or returns when in doubt). Any instructions are placed into the last emitted label or the entry-block if no labels have been emitted yet.

To generate new labels call Proc::newLabel() and to start emitting code to a previous created label call Proc:emitLabel() (see env below). Any labels must always be created (but not necessarily emitted) before emitting jumps to them. Any given label can only be emitted once (ie. calling emitLabel() makes the contents of the previous block final; this is done as sanity checking only, so let me know if you can demonstrate a use-case where this should be relaxed).

Most instructions take their parameters as SSA values. The exceptions are lci/lcf/lcd which take immediate constants and jump-labels which must be labels returned by Proc::newLabel(). For instructions with output values, the methods return the new SSA values and other instructions return void.

Next you can either call Proc::compile() to obtain native code, or you can create a bjit::Module and pass the Proc to Module::compile() which will return an index. Either of these can take an "optimization level" (default: 1) that can be 0 for DCE-only, 1 for "safe" or 2 to allow "unsafe" optimizations (eg. "fast-math" floating-point, allow exceptions from integer div-by-zero to occur earlier/later than expected, etc; essentially slighly more stuff is treated as "undefined behaviour").

Multiple procedures can be compiled into the same module. See below on how procedures can call each other or external functions. Module::load() will load the module into executable memory and Module::unload() will unload it. A module can be loaded and unloaded multiple times. Additional procedures can be compiled at any time, but they will not be loaded until the Module is either reloaded or module is hot-patched with Module::patch() (see bjit.h for details).

When a Module is loaded call Module::getPointer<T>() with an index returned by Module::compile() to get a pointer to your function (with T being the type of the function; we don't check this, we just typecast for you). See one of the tests (eg. tests/test_add_ii.cpp) for an example.

Instructions expect their parameter types to be correct. Passing floating-point values to instructions that expect integer values or vice versa will result in undefined behaviour (ie. invalid code or BJIT_ASSERT; the latter will either call assert or throw bjit::internal_error depending on whether compiled with exceptions). Exceptions should not leak memory, but if you catch an exception, then you should assume that the throwing Proc (or Module) is no longer in consistent state.

The compiler should never fail with valid data unless the IR size limit is exceeded. The limit is 64k IR instructions; we throw bjit::too_many_ops if compiled with exceptions, otherwise we assert as usual; note that instructions removed by DCE and renames/reloads generated during register allocation count towards this limit, so it can also happen during compile(), but if you're seriously worried about this limit, then Bunny-JIT is probably not the best choice for your use-case.

As we do not expect to fail when used correctly (ie. if you're writing a front-end language, your front-end should do the error-checking), we do not provide other error reporting beyond generic BJIT_ASSERT (if you trap these in debugger though, the failure conditions should typically give a reasonable hint as to what went wrong). In practice, at this time it probably will BJIT_ASSERT on valid code in some cases; I'm working on test coverage, but please report a bug if you come across such cases.

The type system is very primitive though and mostly exists for the purpose of tracking which registers we can use to store values. In particular, anything stored in general purpose registers is called _ptr (or simply integers).

Env?

Proc has a public std::vector member env which stores the "environment". When a new label is create with Proc::newLabel() the number and types of incoming arguments to the block are fixed to those contained in env and when jumps are emitted, we check that the contents of env are compatible (same number of values of same types). When Proc::emitLabel() is called to generate code for the label, we replace the contents of env with fresh phi-values. So even though we only handle SSA values, elements of env behave essentially like regular variables (eg. "assignments" can simply store a new SSA value into env). Note that you can adjust the size of env as you please as long as constraints match for jump-sites, but keep in mind that emitLabel() will resize env back to what it was at the time of newLabel().

I should emphasize that you don't necessarily need to put everything into env if your front-end already knows that it doesn't need any phis because it's never assigned to (locally or globally). If you understand SSA, then you can certainly be more intelligent and only keep stuff in env when phis are potentially required, but this is not a requirement: the very first pass of DCE will get rid of any excess phis just fine.

To clarify env is only used by the compiler:

at entry to new procedure it receives the arguments
when newLabel() is called: the types are copied
when emitLabel() is called: phis are generated for the stored types
when a jump to a label is emitted: env is added to target phi alternatives
when a call is emitted: the arguments are taken from env

Instruction set?

Instructions starting i are for integers, u are unsigned variants when there is a distinction, f is single-precision float and d is double-precision float. Note that floating-point comparisons return integers, even though they expect _f32 or _f64 parameters.

The compiler currently exposes the following instructions:

lci i64, lcf f32 and lcd f64 specify constants, jmp label is unconditional jump and jz a then else will branch to then if a is zero or else otherwise and jnz a then else will branch to then if a is non-zero and else otherwise, iret a returns from the function with integer value, fret a with single-precision float value and dret a returns with a double-precision float value.

ieq a b and ine a b compare two integers for equality or inequality and produce 0 or 1.

ilt a b, ile a b, ige a b and igt a b compare signed integers for less, less-or-equal, greater-or-equal and greater respectively

ult a b, ule a b, uge a b and ugt a b perform unsigned comparisons

feq a b, fne a b, flt a b, fle a b, fge a b and fgt a b are single-float version of the same (still produce integer 0 or 1).

deq a b, dne a b, dlt a b, dle a b, dge a b and dgt a b are double-float version of the same (still produce integer 0 or 1).

iadd a b, isub a b and imul a b perform (signed or unsigned) integer addition, subtraction and multiplication, while ineg a negates an integer

idiv a b and imod a b perform signed division and modulo, note that divide-by-zero is undefined behaviour for levelOpt=2 (ie. hardware exceptions might not happen where expected)

udiv a b and umod a b perform unsigned division and modulo, note that divide-by-zero is undefined behaviour for levelOpt=2 (ie. hardware exceptions might not happen where expected)

inot a, iand a b, ior a b and ixor a b perform bitwise logical operations

ishr a b and ushr a b are signed and unsigned right-shift while left-shift (signed or unsigned) is ishl a b and we currently specify that the number of bits to shift is modulo the bitsize of integers (eg. both x64 and Aarch64 do this; on other architectures it could be implemented efficiently by masking, but note that this is subject to change and we might rule it as undefined behaviour at least for the "unsafe" levelOpt=2)

fadd a b, fsub a b, fmul a b, fdiv a b and fneg a are single-float versions of arithmetic operations

dadd a b, dsub a b, dmul a b, ddiv a b and dneg a are double-float versions of arithmetic operations

fabs and dabs compute single- and double-precision absolute value

cf2i a converts singles to integers while ci2f a converts integers to singles

cf2d a converts singles to doubles while cd2f a converts doubles to singles

cd2i a converts doubles to integers while ci2d a converts integers to doubles

bcf2i a and bci2f a bit-cast (ie. reinterpret) float-to-int and int-to-float without conversion

bcd2i a and bci2d a bit-cast (ie. reinterpret) double-to-int and int-to-double without conversion

i8 a, i16 a and i32 a can be used to sign-extend the low 8/16/32 bits

u8 a, u16 a and u32 a can be used to zero-extend the low 8/16/32 bits

Loads follow the form lXX ptr off16 where ptr is integer SSA value and off16 is an unsigned immediate offset (eg. for field offsets). The variants defined are li8/16/32/64, lu8/16/32 and lf32/64. The integer i variants sign-extend while the u variants zero-extend.

Stores follow the form sXX value ptr off16 where ptr and off16 are like loads while value is the SSA value to store. Variants are like loads, but without the unsigned versions.

While the compiler doesn't move loads across stores (or other side-effects) it can move them out of loops. If you need to prevent this (eg. for multi-threading reasons) then you can use fence to force a memory barrier. On x64 this is a pure compiler fence (no code generated), but on Arm64 we do issue a full memory barrier.

Internally we have additional instructions that the fold-engine will use (in the future the exact set might vary between platforms, so we rely on fold), but they should be fairly obvious when seen in debug, eg. jugeIis a conditional jump on uge comparison with the second operand converted to an imm32 field.

Calling functions?

Function call support is still somewhat limited as we only support up to 4 parameters and the support is currently not particularly robust as it relies on register allocator not accidentally overwriting parameters. This "should not happen"(tm), but there is no real sanity-checking done for this, yet.

Why up to 4? Because this is as many as we can pass in registers on Windows and although this could be relaxed on Unix, I want to keep the feature-set identical across platforms. If your code works on one platform, it shouldn't suddenly fail on another. The limitation will probably always stay for tail-calls, but we'll hopefully will be supporting more parameters for regular calls in the future.

There are essentially two types of calls: near-calls and indirect calls.

Indirect ("pointer") calls can be done with icallp, fcallp and dcallp which take a pointer to a function (as SSA value) and the number of arguments. The arguments are taken from the end of env (ie. push_back() them left-to-right; calls don't pop the arguments, you'll have to clean them up yourself). icallp returns an integer value, fcallp returns single-float and dcallp returns a double-float value.

There is also tcallp which performs a tail-call, effectively doubling as a return with the return value of the call. As it does not return to the procedure, it can (and generally should) be the last thing in a given block. Tail-calls always clean up the stack before the call, so infinite chains of tail-calls are fine, but if you allocated a stack block then this is already invalid at the time of the call (ie. don't pass pointers to stack with tail-calls, it won't work).

There is also "near" versions icalln, fcalln, dcalln and tcalln which can be used to call other procedures in the same module. These take the (compile-time) index of the procedure as their first parameter. Module::compile() is guaranteed to give procedures sequential indexes starting from 0 so the target procedure need not be compiled first as long as the index is valid by the time Module::load() is called.

In order to call functions outside the module without having to use lci to load a constant address every time, Module also supports stubs. To compile a stub, call Module::compileStub() with the memory address of the target procedure. Stubs count as procedures in terms of indexes and can be called with near calls.

Patching calls

You can also change far call stub targets later by calling Module::patchStub() with the index of a previously compiled stub and a new target address. The stub target will be updated (in executable code) next time either Module::patch() or Module::load() is called. Note that "bad things will happen"(tm) if you try to patch a procedure that is not actually a stub (we don't check this in any way).

Near-calls can also be patched, either globally with Module::patchCalls(oldT,newT) or locally in one procedure with Module::patchCallsIn(inProc,oldT,newT) where oldT and newT are the old and new targets respectively. There is currently no support for patching a single call though (you'll need to compile some stubs for that).

Note that Module::patch() does not apply any patches if it can't also load all newly compiled code, but they remain pending and will be applied on module reload.

What it does?

The test_sieve.cpp contains a C++ variation of the classic sieve algorithm, with "flags" array and size as parameters:

int sieve(char * flags, int size)
{
    int count = 0;

    for (int i = 0; i < size; ++i) flags[i] = true;
    for (int i = 2; i < size; ++i)
    { 
        if (flags[i])
        {
            int prime = i + 1; 
            int k = i + prime; 

            while (k < size)
            { 
                flags[k] = false; 
                k += prime;
            }
            
            ++count;
        }
    }

    return count;
}

It also contains the same thing manually translated to the Bunny-JIT API:

    bjit::Module module;
    {
        bjit::Proc  pr(0, "ii");

        int _flags = 0;
        int _size = 1;

        // i = 0
        int _i      = pr.env.size(); pr.env.push_back(pr.lci(0));
        // count = 0
        int _count  = pr.env.size(); pr.env.push_back(pr.lci(0));

        auto ls0 = pr.newLabel();
        auto lb0 = pr.newLabel();
        auto le0 = pr.newLabel();

        pr.jmp(ls0);
        
        pr.emitLabel(ls0);
        // while i < size
        pr.jz(pr.ilt(pr.env[_i], pr.env[_size]), le0, lb0);
        pr.emitLabel(lb0);
        
            // *(flags + i) = 1
            pr.si8(pr.iadd(pr.env[_flags], pr.env[_i]), 0, pr.lci(1));
            // ++i
            pr.env[_i] = pr.iadd(pr.env[_i], pr.lci(1));
            pr.jmp(ls0);
            
        pr.emitLabel(le0);

        // i = 2
        pr.env[_i] = pr.lci(2);
        auto ls1 = pr.newLabel();
        auto lb1 = pr.newLabel();
        auto le1 = pr.newLabel();

        pr.jmp(ls1);
        pr.emitLabel(ls1);
        
        // while i < size
        pr.jz(pr.ilt(pr.env[_i], pr.env[_size]), le1, lb1);
        pr.emitLabel(lb1);
            
            auto bt = pr.newLabel();
            auto be = pr.newLabel();
    
            // if flags[i] != 0
            pr.jnz(pr.li8(pr.iadd(pr.env[_flags], pr.env[_i]), 0), bt, be);
            pr.emitLabel(bt);
        
                // prime = i + 1
                int _prime  = pr.env.size();
                pr.env.push_back(pr.iadd(pr.env[_i], pr.lci(1)));
        
                // k = i + prim
                int _k = pr.env.size();
                pr.env.push_back(pr.iadd(pr.env[_i], pr.env[_prime]));
        
                auto ls2 = pr.newLabel();
                auto lb2 = pr.newLabel();
                auto le2 = pr.newLabel();
        
                pr.jmp(ls2);
                pr.emitLabel(ls2);
        
                // while k < size
                pr.jnz(pr.ilt(pr.env[_k], pr.env[_size]), lb2, le2);
                pr.emitLabel(lb2);
        
                    // flags[k] = false;
                    pr.si8(pr.iadd(pr.env[_flags], pr.env[_k]), 0, pr.lci(0));

                    // k = k + prime
                    pr.env[_k] = pr.iadd(pr.env[_k], pr.env[_prime]);
            
                    pr.jmp(ls2);
                    
                pr.emitLabel(le2);
        
                pr.env.pop_back();  // k
                pr.env.pop_back();  // prime
        
                pr.env[_count] = pr.iadd(pr.env[_count], pr.lci(1));
        
                pr.jmp(be);
            
            pr.emitLabel(be);
            
            // ++i
            pr.env[_i] = pr.iadd(pr.env[_i], pr.lci(1));
    
            pr.jmp(ls1);

        pr.emitLabel(le1);

        pr.iret(pr.env[_count]);

        pr.debug();

        module.compile(pr);
    }

Bunny-JIT currently compiles it into this native code:

   0:	48 83 ec 08                                     	sub    rsp,0x8
   4:	33 c0                                           	xor    eax,eax
   6:	b9 02 00 00 00                                  	mov    ecx,0x2
   b:	48 83 fe 00                                     	cmp    rsi,0x0
   f:	0f 8e 19 00 00 00                               	jle    0x2e
  15:	ba 01 00 00 00                                  	mov    edx,0x1
  1a:	4c 8b c0                                        	mov    r8,rax
  1d:	41 88 14 38                                     	mov    BYTE PTR [r8+rdi*1],dl
  21:	49 83 c0 01                                     	add    r8,0x1
  25:	4c 3b c6                                        	cmp    r8,rsi
  28:	0f 8c ef ff ff ff                               	jl     0x1d
  2e:	48 83 fe 02                                     	cmp    rsi,0x2
  32:	0f 8e 4e 00 00 00                               	jle    0x86
  38:	48 8b d0                                        	mov    rdx,rax
  3b:	4c 8b c1                                        	mov    r8,rcx
  3e:	48 83 c1 01                                     	add    rcx,0x1
  42:	4d 0f be 0c 38                                  	movsx  r9,BYTE PTR [r8+rdi*1]
  47:	4d 85 c9                                        	test   r9,r9
  4a:	0f 85 11 00 00 00                               	jne    0x61
  50:	48 3b ce                                        	cmp    rcx,rsi
  53:	0f 8c e2 ff ff ff                               	jl     0x3b
  59:	48 8b c2                                        	mov    rax,rdx
  5c:	e9 25 00 00 00                                  	jmp    0x86
  61:	48 83 c2 01                                     	add    rdx,0x1
  65:	4c 03 c1                                        	add    r8,rcx
  68:	4c 3b c6                                        	cmp    r8,rsi
  6b:	0f 8d df ff ff ff                               	jge    0x50
  71:	41 88 04 38                                     	mov    BYTE PTR [r8+rdi*1],al
  75:	4c 03 c1                                        	add    r8,rcx
  78:	4c 3b c6                                        	cmp    r8,rsi
  7b:	0f 8c f0 ff ff ff                               	jl     0x71
  81:	e9 ca ff ff ff                                  	jmp    0x50
  86:	48 83 c4 08                                     	add    rsp,0x8
  8a:	c3                                              	ret    
  8b:	90                                              	nop
  8c:	90                                              	nop
  8d:	90                                              	nop
  8e:	90                                              	nop
  8f:	90                                              	nop

Compared with clang -Ofast this silly test typically runs around 10-20% slower. You should expect Bunny-JIT to do significantly worse on code that relies heavily on optimizing memory access or code that can be effectively vectorized, but the basic idea is that it can typically output something reasonable.

SSA?

The backend keeps the code in SSA form from the beginning to the end. We rely on env to automatically add phis for all cross-block variables initially. While there is no need to add temporaries to the environment, always adding phis for any actual local variables still creates a lot more phis than necessary. We choose to let DCE clean this up, by simplifying those phis with only one real source. This cleanup is fundamental to the design as it acts as a dataflow analysis pass and is repeated several times through the optimization process.

We keep the SSA structure all the way. The code is valid SSA even after register allocation. Registers are not SSA values, but each value has one register. When we need to rename registers or reload values from stack, we define new values. We handle phis by simply making sure that the phi-functions are no-ops: all jump-sites place the correct values in either the same registers or stack-slots, depending on what the phi expects. Two-way jumps always generate shuffle blocks, which are then jump-threaded if the edge is not actually critical or the shuffle is empty.

The register allocator itself runs locally, using "furthest next use" to choose which values to throw out of the register file, but passes information from one block to the next to reduce pointless shuffling. We don't ever explicitly spill, rather we flag the source operation with a spill-flag when we emit a reload. This is always valid in SSA, because we have no variables, only values.

The assembler will then generate stores after any operations marked for spill, because we resolve SCCs to actual slots only after register allocation is done.

SCC?

To choose stack locations, we compute what I like to call "stack congruence classes" (SCCs) to find which values can and/or should be placed into the same slot. Essentially if two values are live at the same time, then they must have different SCCs. On the other hand, if a value is argument to a phi, then we would like to place it into the same class to avoid having to move spilled values from one stack slot to another. For other values, we would like to try and find a (nearly) minimal set of SCCs that can be used to hold them, in order to keep the stack frames as small as possible.

As pointed out earlier, sometimes we have cycles. Eg. if two values are swapped in a loop every iteration, then we can't allocate the same SCC to these without potentially forcing a swap in memory. We solve SCCs (but not slots) before register allocation, so at this point we don't know which variables live in memory. To make sure the register allocator doesn't need worry about cycles (in memory; it can deal with cycles in actual registers) the SCC computation adds additional renames to temporary SCCs. This way the register allocator can mark the rename (rather than the original value) for spill if necessary, otherwise the renames typically compile to nops. The register allocator itself never adds any additional SCCs: at that point we already have enough to allow every shuffle to be trivial.

Only after register allocation is done, do we actually allocate stack slots. At this point, we find all the SCCs with at least one value actually spilled and allocate slots for these (and only these), but since we know (by definition) that no two variables in the same SCC are live at the same time, at this point we don't need to do anything else. We just rewrite all the SCCs to the slots allocated (or "don't need" for classes without spills) and pass the total number of slots to the assembler.

The beauty of this design is that it completely decouples the concerns of stack layout and register allocation: the latter can pretend that every value has a stack location, that any value can be thrown away and reloaded at any time (as long as it's then marked for "spill" at the time the reload is done) yet we can trivially collapse the layout afterwards. The assembler will then generate the actual stores after the operations that need them (valid because of SSA).

Optimizations

At the beginning of this page I said above I find traditional compiler optimizations a bit confusing to relate to, because they talk about things like versions of variables and safety. In SSA, there is exactly one version of a value and values are referentially transparent: as long as the value is defined in some dominator block the value is safe. But here's what we do:

The DCE pass (opt_dce) serves a couple of purposes: first, it finds all the actual live-blocks by doing a depth-first search from the entry. Any block that isn't reached will be ignored. It also counts the number of uses for each value and throw away anything (without side-effects) with a use-count of zero (and then all values that now have a use-count of zero and so on). It also performs simple jump threading and replaces phi uses with actual values if all alternatives are either the same or simple loop-back to the phi itself. This essentially gives us a simple form of data-flow analysis.

Invariants: DCE builds the list of live blocks, so it must run as the very first thing even for non-optimizing builds. It does not rely on any other analysis. DCE is used a cleanup pass, so all other passes (including register allocation) must leave the IR in a state where DCE is safe (ie. we even do a final DCE pass after register allocation, just before native assembly; if this breaks the code, it's almost certainly a bug in opt_ra).

DCE (and other optimizations) also calls rebuild_cfg which rebuilds "come from" information and cleans up phi alternatives where the incoming edge no longer exists. Invariants: rebuild_cfg assumes live contains all live blocks and that all jumps can be found by looking at the last ops of each block (ie. all dead tails have been eliminated).

The opt_fold pass only does simple constant-folding/strength-reduction and algebraic simplications and only relies on live containing all live blocks. Operations simplified to nop are left in-place for DCE to clean up.

Because opt_fold also simplifies conditional jumps into non-conditional jumps and because opt_dce simplies unnecessary phi loops this gives us much of SCCP although we can't handle the cases where a branch must be taken for the branch condition to ever choose that branch. I'm not entirely convinced this would really be worth the trouble.

Invariants: Folding does not rewrite CFG, dominators are used for reassoc only and it only relies on the SSA invariant that definitions always dominate uses.

The opt_cse pass does two things: it first tries to hoist operations up the dominator chain to the earliest block where all inputs are available, unless the operation is marked with flags.no_opt (eg. we already sunk the op where it's needed; FIXME: we might want to make hoisting a bit more intelligent and allow it to break critical edges directly).

Then opt_cse tries to globally combine all pairs of identical ops and place the combined op at the closest common dominator (by SSA property there always exists a CCD where all the inputs are defined), but only if it can reach the CCD by following a two-way dominator/post-dominator chain (ie. don't cross branches that don't also merge, but diamonds are fine; this is ignored for the CCD itself, because if we come from two different branches, then both branches compute the same value and combining just results in smaller code).

CSE also tries to match against the alternatives of a phi operands. If we find matches this way, we insert similar ops on other paths (assuming they don't have one already) and then collect results into a new phi which gives us PRE for the simple cases, although we don't currently attempt to recursively search through multiple phis (maybe some day; this can get a bit expensive though).

Invariants: CSE moves/renames operations, can insert additional phi and break critical edges, but rebuilds dominators if it changes the CFG.

Because opt_cse needs dominator information (for both hosting and actual CSE) to avoid moving operations in the wrong places, it calls rebuild_dom which finds the immediate dominators .idom and immediate post-dominators .pdom for each block and then builds a per-block sorted list of all dominators .dom to allow for easier dominance checks and CCD searches.

Invariants: rebuild_dom calls rebuild_cfg so it rebuilds both.

The opt_sink pass does the opposite of hoisting and tries to move ops down branches where they are actually needed. For this it needs live-in information from rebuild_livein and because it can break critical edges if necessary it will invalidate the CFG and dominator information.

The opt_jump pass optimizes jumps. Currently it only optimizes jumps back to dominators (loop back edges; opt_jump_be) by making a copy of the block and adding new phis for all live-in values (globally) that originated from the (supposedly) loop header. This effectively results in loop inversion and allows opt_sink to create pre-headers by breaking critical edges. Because opt_jump rewrites CFG, it needs both livein and dominators and it will invalidate both. It rebuilds CFG, but not dominators.

The opt_scc pass computes SCCs and the opt_ra pass performs register allocation. There are only done once at the end of the compilation and they are always done, even for non-optimized builds. After RA the code is ready to be assembled. Note that code is still valid SSA.

Currently Bunny-JIT does very limited memory optimization: we allow DCE, CSE and hoisting on loads, but assume any side-effect (of any kind) can alias. It probably never try to do sophisticated analyse aliasing, because this is such a huge can of worms, but I'm looking to maybe optimize some of the simple cases (eg. identical loads with no stores between them) in the future.

Bunny-JIT is currently very inefficient compiler for high-level object-oriented languages that rely heavily on following the same pointer-chains over and over again and it will probably always be very inefficient for languages that rely heavily on mutating objects in memory. I don't really care, it's not my focus.

Name		Name	Last commit message	Last commit date
Latest commit History 882 Commits
front		front
src		src
tests		tests
win		win
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dump-bin-arm.sh		dump-bin-arm.sh
dump-bin-x64.sh		dump-bin-x64.sh
run-tests.sh		run-tests.sh

signaldust/bunny-jit

Folders and files

Latest commit

History

Repository files navigation

Bunny-JIT

Why Bunny?

License?

How to build?

Contributing?

Code of conduct?

Instructions?

Env?

Instruction set?

Calling functions?

Patching calls

What it does?

SSA?

SCC?

Optimizations

About

Topics

Resources

Stars

Watchers

Forks

Languages