Skip to content

Mirror of RTL MJIT

License

Unknown, Unknown licenses found

Licenses found

Unknown
COPYING
Unknown
COPYING.ja
Notifications You must be signed in to change notification settings

ruby-compiler-survey/rtl-mjit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What's the branch about

  • The branch rtl_mjit_branch is used for development of RTL (register transfer language) VM insns and MRI JIT (MJIT in brief) of the RTL insns

  • The last branch merge point with the trunk is always the head of the branch rtl_mjit_branch_base

    • The branch rtl_mjit_branch will be merged with the trunk from time to time and correspondingly the head of the branch rtl_mjit_branch_base will be the last merge point with the trunk

RTL insns

  • The major goal of RTL insns introduction is an implementation of IR for Ruby code analysis and optimizations

    • The current stack based insns are an inconvenient IR for such goal
  • Secondary goal is faster interpretation of VM insns

    • Stack based insns create additional memory traffic. Let us consider Ruby code a = b + c. Stack insns vs RTL insns for the code:
                  getlocal_OP__WC__0 <b index>
                  getlocal_OP__WC__0 <c index>
                  opt_plus
                  setlocal_OP__WC__0 <a index>
                           vs   
                  plus <a index>, <b index>, <c index>
  • Stack based insns are shorter but usually require more insns than RTL ones for the same Ruby code
    • We save time on memory traffic and insn dispatching
    • In some cases, RTL insns can be the same number as stack-based insns as typical Ruby code contains a lot of calls. In such cases, executing RTL insns will be slower executing stack insns

RTL insn operands

  • What could be an operand:

    • only temporaries
    • temporaries and locals
    • temporaries and locals even from higher levels
    • above + instance variables
    • above + class variables, globals
  • Using only temporaries has little sense as it will produce code with practically the same number insns which are longer

  • Decoding overhead of numerous type operands will be not compensated by processing smaller number of insns

  • The complicated operands also complicate optimizations and MJIT

  • Currently we use only temporaries and locals as preliminary experiments show that it is the best approach

  • Practically any RTL insn might be an ISEQ call. Therefore we need to provide a solution to put a result at the destination operand as the call will always put it on the stack

    • If RTL insn is actually an ISEQ call, we change a return PC. So the next insn executed after call will be an insn moving the result on the stack to the insn destination
    • To decrease memory overhead, the move insn is a part of the original insn
      • For example, in the case of a call in "plus <cont insn>, <call data>, dst, op1, op2" the next executed insn will be "<cont insn> <call data>, dst, op1, op2"

RTL insn combining and specialization

  • Immediate value specialization (e.g. plusi - addition with immediate fixnum)

  • Frequent insn sequences combining (e.g. bteq - comparison and branch if the operands are equal)

Speculative insn generation

  • Some initially generated insns during their execution can be transformed into speculative ones

    • Speculation is based on operand types (e.g. plus can be transformed into an integer plus) and on the operand values (e.g. no multi-precision integers)
  • Speculative insns can be transformed into unchanging regular insns if the speculation is wrong

    • Speculation insns have a code checking the speculation correctness
  • Speculation will be more important for JITed code performance

Two approaches to generate RTL insns:

  • The simplest way is to generate RTL insns from the stack insns
  • A faster approach is to generate directly from MRI parse tree nodes.
  • We use the later approach as it makes MRI faster

RTL insns status and future work

  • It mostly works (make check reports no regressions)

  • Still a lot of work should be done for performance analysis and performance tuning work

  • There are a lot of changed files but major changes are in:

    • insns.def: New definitions of RTL insns
    • rtl_exec.c: Most of code executing RTL insns
    • compile.c: Translations of the parse tree into RTL insns. The file is practically rewritten but I tried to use the same structure and function names

MRI JIT

A few possible approaches in JIT implementation:

  • JIT specialized for a specific language (e.g. luajit, rujit)

    • Pro: achievability of very fast compilation
    • Con: a lot of efforts to implement decent optimizations and multi-target generation
  • Using existing VMs with JIT or JIT libraries: Oracle JVM and Graal, IBM OMR, different JavaScript JITs, libjit

    • Pro: saving a lot of efforts
    • Cons: Big dependency on code which is hard to control. Less optimized code than MRI generated by used C compilers (even with using JVM server compiler). Most of the JITs are already used for Ruby implementation
  • Using JITs frameworks of existing C compilers: GCC JIT, LLVM JIT engines

    • Pro: saving a lot of efforts in generating highly optimized code for multiple targets. No new dependencies as C compilers are used for building MRI
    • Cons: Unstable interfaces. An LLVM JIT is already used by Rubicon. A lot of efforts in preparation of code used by RTL insns (an environment)
  • Using existing C compilers

    • Pro: Very stable interface. The simplest approach to generate highly optimized code for multiple targets (minimal changes to MRI). Small efforts to prepare the environment. Portability (e.g. GCC or LLVM can be used). No new dependencies. Easy JITed code debugging. Rich optimization set of industrial C compilers has a potential to generate a better code especially if we manage to provide profile info to them
    • Con: Big JIT code compilation time because of time spent on lexical, syntax, semantic analysis and optimizations not tailored for the speedy work
  • The above is just a very brief analysis resulting in me to use the last approach. It is the simplest one and adequate for long running Ruby programs like Ruby on Rails

    • Spending efforts to speed up the compilation

MJIT organization

  _______     _________________
 |header |-->| minimized header|
 |_______|   |_________________|
               |                         MRI building
 --------------|----------------------------------------
               |                         MRI execution
               |                                            
  _____________|_____
 |             |     |
 |          ___V__   |  CC      ____________________
 |         |      |----------->| precompiled header |
 |         |      |  |         |____________________|
 |         |      |  |              |
 |         | MJIT |  |              |
 |         |      |  |              |
 |         |      |  |          ____V___  CC  __________
 |         |______|----------->| C code |--->| .so file |
 |                   |         |________|    |__________|
 |                   |                              |
 |                   |                              |
 | MRI machine code  |<-----------------------------
 |___________________|             loading

  • MJIT is a method JIT (one more reason for the name)

  • An important organization goal is to minimize the JIT compilation time

  • To simplify JIT implementation the environment (C code header needed to C code generated by MJIT) is just vm.c file

  • A special Ruby script minimize the environment * Removing about 90% declarations

  • MJIT has a several threads (workers) to do parallel compilations

    • One worker prepares a precompiled code of the minimized header

      • It starts at the MRI execution start
    • One or more workers generate PIC object files of ISEQs

      • They start when the precompiled header is ready
      • They take ISEQs from a priority queue unless it is empty.
      • They translate ISEQs into C-code using the precompiled header, call CC and load PIC code when it is ready
  • MJIT put ISEQ in the queue when ISEQ is called or right after generating ISEQ for AOT (Ahead Of Time compilation)

  • MJIT can reorder ISEQs in the queue if some ISEQ has been called many times and its compilation did not start yet or we need the ISEQ code for AOT

  • MRI reuses the machine code if it already exists for ISEQ

  • All files are stored in /tmp. On modern Linux /tmp is a file system in memory

  • The machine code execution can stop and switch to the ISEQ interpretation if some condition is not satisfied as the machine code can be speculative or some exception raises

  • Speculative machine code can be canceled, and a new mutated machine code can be queued for creation

    • It can happen when insn speculation was wrong
    • There is a constraint on the mutation number. The default value can be changed by a MJIT option. The last mutation will contain the code without any speculation insns
  • There are more speculations in JIT code than in the interpreter mode:

    • Global speculation about tracing
    • Global speculation about absence of basic type operations redefinition
    • Speculation about equality of EP (environment pointer) and BP (basic stack pointer)
  • When a global speculation becomes wrong, all currently executed JIT functions are canceled and the corresponding ISEQs continue their execution in the interpreter mode

    • It is implemented by checking a special control frame flag after each call which can affect a global speculation
  • In AOT mode, ISEQ JIT code creation is queued right after the ISEQ creation and ISEQ JIT code is always tried to be executed first. In other words, VM waits the creation of JIT code if it is not available

    • Now AOT probably has a sense mostly for big long running programs
  • MJIT options can be given on the command line or by environment variable RUBYOPT (the later probably will be removed in the future)

MJIT status

  • It is on very early stages of the development and only ready for usage of few small and simple Ruby programs
    • make test has no issues

    • The compilation of small ISEQ takes about 50-70 ms on modern x86-64 CPUs

    • No Ruby program real time execution slow down because of MJIT

    • Depending on a MJIT option, GCC or LLVM is used

      • Some benchmarks are faster with GCC, some are faster with LLVM Clang
      • There are a few factors (mostly relation between compilation speed and generated code quality) making hard to predict the outcome
      • As GCC and LLVM are ABI compatible you can compile MRI by GCC and use LLVM for MJIT or vise verse
    • MJIT is switched on by -j option

    • Some other useful MJIT options:

      • -j:v helps to see how MJIT works: what ISEQs and when are compiled
      • -j:p prints a final profile about how frequently ISEQs were executed in the interpreter and JIT mode
      • -j:a switches NJIT on in AOT mode
      • -j:s saves the precompiled header and all C files and object files in /tmp after MRI finish
      • -j:t=N defines number of threads used by MJIT to compile ISEQs in parallel (default N is 1)
      • Use ruby option --help to see all MJIT options

MJIT future works

  • A lot of things should be done to use MJIT. Here are the high priority ones:

    • Make it working for make check

    • Generation of optimized C code:

      • The ultimate goal is to provide possibility of inlining on paths Ruby->C->Ruby where Ruby means C code generated by MJIT for user defined Ruby methods and C means MRI C code implementing some predefined Ruby methods (e.g. times for Number)
      • More aggressively speculative C code generation with more possibilities for C compiler optimizations, e.g. speculative constant usage for C compiler constant folding, (conditional) constant propagation, etc.
      • Translations of Ruby temporaries and locals into C locals and saving them on MRI thread stack in case of deoptimization
        • Direct calls of C functions generated for ISEQs by MJIT (another form of speculations)
      • Transition from static inline functions to extern inline for GCC and Clang to permit the compilers themselves decide about inlining profitability
      • Pass profile info through hot/cold function attributes
        • May be pass more detail info through C compiler profile info format in the future
    • Tuning MJIT for faster compilation and less waiting time

    • Implementing On Stack Replacement (OSR)

      • OSR is a replacement of still executed byte code ISEQ by JIT generated machine code for the ISEQ
      • It is a low priority task as it is usable now only for ISEQs with while-statements
    • Tailor MJIT for a server environment

      • Reuse the same ISEQ JIT code for different running MRI instances
      • Use a crypto-hash function to search JIT code for given pair (PCH hash, ISEQ hash)
    • MJIT vulnerability

      • Prevent adversary from changing C compiler
      • Prevent adversary from changing MJIT C and object files
      • Prevent adversary from changing MJIT headers
        • Use crypto hash function to check the header authenticity

Update: 15th June, 2017

  • MJIT is reliable enough to run some benchmarks to evaluate its potential

  • All measurements are done on Intel 3.9GHz i3-7100 with 32GB memory under x86-64 Fedora Core25

  • For the performance comparison I used the following implementations:

    • v2 - Ruby MRI version 2.0
    • base - Ruby MRI (2.5 development) version on which rtl_mjit branch is based
    • rtl - rtl_mjit branch as of 31th May without using JIT
    • mjit - as above but with using MJIT with GCC 6.3.1 with -O2
    • mjit-cl - MJIT using LLVM Clang 3.9.1 with -O2
    • omr - Ruby OMR (2016-12-24 revision 57163) in JIT mode (-Xjit)
    • jruby9k - JRruby version 9.1.8.0
    • jruby9k-d - as above but with using -Xcompile.invokedynamic=true
    • graal-22 - Graal Ruby version 0.22
  • I used the following micro-benchmarks (se MJIT-benchmarks directory):

    • while - while loop
    • nest-while - nested while loops (6 levels)
    • nest-ntimes - nested ntimes loops (6 levels)
    • ivread - reading an instance variable (@var)
    • ivwrite - assignment to an instance variable
    • aread - reading an instance variable through attr_reader
    • awrite - assignment to an instance variable through attr_writer
    • aref - reading an array element
    • aset - assignment to an array element
    • const - reading Const
    • const2 - reading Class::Const
    • call - empty method calls
    • fib - fibonacci
    • fannk - fannkuch
    • sieve - Eratosthenes sieve
    • mandelbrot - (non-complex) mandelbrot as MRI v2 does not support complex numbers
    • meteor - meteor puzzle
    • nbody - modeling planet orbits
    • norm - spectral norm
    • trees - binary trees
    • pent - pentamino puzzle
    • red-black - Red Black trees
    • bench - rendering
  • MJIT has a very fast startup which is not true for JRuby and Graal Ruby

    • To give a better chance to JRuby and Graal Ruby the benchmarks were modified in a way that Ruby MRI v2.0 runs about 20s-70s on each benchmark
  • Each benchmark ran 3 times and the minimal time (or minimal peak memory consumption) was chosen

    • In the tables for times I use ratio <MRI v2.0 time>/time. It show how the particular implementation is faster than MRI v2.0
    • For memory I use ratio <peak memory>/<MRI v2.0 peak memory> which shows how the particular implementation is memory hungrier than MRI v2.0
  • I also used optcarrot for more serious program performance comparison

    • I used default frames number (180) and 2000 frames to run optcarrot

Microbenchmark results

  • Wall time speedup ('wall MRI v2.0 time' / 'wall time')
    • MJIT gives a real speedup comparable with other Ruby JITs
    • OMR is currently the worst performance JIT
    • In most cases, using GCC is better choice for MJIT than LLVM

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
while 1.0 1.11 1.82 387.29 9.28 1.06 2.3 2.89 2.35
nest-while 1.0 1.11 1.71 4.97 3.97 1.05 1.38 2.58 1.66
nest-ntimes 1.0 1.02 1.13 2.19 2.43 1.01 0.94 0.97 2.19
ivread 1.0 1.13 1.31 13.67 9.48 1.13 2.42 2.99 2.33
ivwrite 1.0 1.18 1.78 15.01 7.59 1.13 2.52 2.93 1.97
aread 1.0 1.03 1.44 19.69 7.03 0.98 1.79 3.53 2.17
awrite 1.0 1.09 1.42 13.09 7.45 0.96 2.18 3.74 2.55
aref 1.0 1.13 1.67 25.73 10.17 1.09 1.87 3.69 3.71
aset 1.0 1.51 2.68 23.45 17.82 1.47 3.61 4.49 6.33
const 1.0 1.09 1.53 27.53 10.15 1.05 2.98 3.89 3.01
const2 1.0 1.12 1.31 26.13 10.06 1.09 3.05 3.81 2.41
call 1.0 1.14 1.54 5.53 4.75 0.9 2.18 4.99 2.86
fib 1.0 1.21 1.43 4.16 3.81 1.1 2.17 5.03 2.26
fannk 1.0 1.05 1.1 1.1 1.1 0.99 1.71 2.32 1.02
sieve 1.0 1.3 1.72 3.34 3.36 1.27 1.49 2.42 2.02
mandelbrot 1.0 0.94 1.11 2.08 2.11 1.08 0.96 1.56 2.45
meteor 1.0 1.24 1.27 1.71 1.71 1.16 0.9 0.92 0.54
nbody 1.0 1.05 1.14 2.73 3.07 1.26 0.97 2.31 2.14
norm 1.0 1.13 1.09 2.52 2.49 1.15 0.91 1.45 1.62
trees 1.0 1.14 1.23 2.3 2.21 1.2 1.41 1.53 0.78
pent 1.0 1.13 1.24 1.71 1.7 1.13 0.6 0.8 0.33
red-black 1.0 1.01 0.94 1.3 1.14 0.88 0.98 2.52 1.03
bench 1.0 1.16 1.18 1.54 1.57 1.15 1.28 2.75 1.81
GeoMean. 1.0 1.12 1.39 6.18 4.02 1.09 1.59 2.48 1.83

  • CPU time improvements ('CPU MRI v2.0 time' / 'CPU time')
    • CPU time is important too for cloud (money) or mobile (battery)
    • MJIT almost always spends less CPU than the current MRI interpreter
    • Graal is too aggressive with compilations and almost always needs more CPU work than MRI interpreter

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
while 1.0 1.11 1.82 208.38 8.94 1.06 1.96 2.39 0.97
nest-while 1.0 1.11 1.71 4.81 3.84 1.05 1.16 1.94 0.67
nest-ntimes 1.0 1.02 1.13 2.13 2.34 1.01 0.86 0.87 0.76
ivread 1.0 1.13 1.31 13.22 9.15 1.13 2.1 2.45 0.97
ivwrite 1.0 1.18 1.78 14.16 7.28 1.13 1.9 2.19 0.72
aread 1.0 1.03 1.44 18.67 6.78 0.98 1.51 2.61 0.84
awrite 1.0 1.09 1.42 12.62 7.21 0.96 1.83 2.75 0.97
aref 1.0 1.13 1.67 24.83 9.91 1.09 1.69 3.14 1.55
aset 1.0 1.51 2.68 22.94 17.36 1.47 3.27 3.97 2.68
const 1.0 1.09 1.53 26.86 10.0 1.05 2.71 3.48 1.68
const2 1.0 1.12 1.31 25.52 9.89 1.09 2.77 3.41 1.52
call 1.0 1.14 1.54 5.42 4.62 0.9 1.83 3.54 1.02
fib 1.0 1.21 1.43 4.11 3.74 1.1 1.85 3.59 0.87
fannk 1.0 1.05 1.1 1.1 1.1 0.99 1.43 1.79 0.47
sieve 1.0 1.3 1.72 3.33 3.32 1.27 1.26 1.86 0.77
mandelbrot 1.0 0.94 1.11 1.99 2.01 1.08 0.86 1.31 0.77
meteor 1.0 1.24 1.27 1.37 1.3 1.16 0.76 0.73 0.16
nbody 1.0 1.05 1.14 2.49 2.74 1.26 0.82 1.68 0.75
norm 1.0 1.13 1.09 2.31 2.24 1.15 0.8 1.19 0.57
trees 1.0 1.14 1.23 2.15 2.04 1.2 1.02 1.04 0.25
pent 1.0 1.13 1.24 1.42 1.36 1.13 0.45 0.54 0.09
red-black 1.0 1.01 0.94 0.75 0.64 0.88 0.66 1.16 0.3
bench 1.0 1.16 1.18 1.21 1.24 1.15 0.96 1.72 0.51
GeoMean. 1.0 1.12 1.39 5.55 3.67 1.09 1.33 1.88 0.69

  • Peak memory increase ('max resident memory' / 'max resident MRI v2.0 memory')
    • Memory consumed by MJIT GCC or LLVM (data and code) is included
    • JITs require much more memory than the interpreter
    • OMR is the best between the JITs
    • JRuby and Graal are the worst with memory consumption

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
while 1.0 0.98 1.06 4.34 8.31 2.49 445.13 460.07 117.57
nest-while 1.0 0.98 1.05 4.79 8.45 2.48 242.94 395.77 93.86
nest-ntimes 1.0 0.99 1.06 4.59 8.58 3.29 187.39 199.13 118.92
ivread 1.0 1.0 1.06 4.4 8.4 2.52 459.46 459.72 118.03
ivwrite 1.0 1.0 1.06 4.44 8.51 2.5 380.97 461.57 120.73
aread 1.0 0.99 1.08 4.49 8.55 2.53 287.09 466.46 94.07
awrite 1.0 0.99 1.06 4.48 8.47 2.5 327.78 385.46 94.91
aref 1.0 1.0 1.08 4.47 8.51 2.53 330.23 463.06 119.15
aset 1.0 0.99 1.07 4.55 8.55 2.55 331.78 467.5 119.14
const 1.0 0.98 1.06 4.39 8.47 2.5 464.19 463.56 122.41
const2 1.0 0.99 1.06 4.38 8.44 2.48 459.53 459.84 110.59
call 1.0 0.98 1.05 4.59 8.37 3.26 217.1 453.82 96.58
fib 1.0 1.01 1.05 4.78 8.45 3.19 37.33 37.96 144.21
fannk 1.0 1.03 1.09 3.96 7.93 2.55 212.3 237.77 154.02
sieve 1.0 1.03 1.03 1.03 1.03 1.05 16.06 16.05 5.12
mandelbrot 1.0 1.01 1.08 7.28 9.46 3.4 324.38 426.65 101.42
meteor 1.0 1.0 1.05 5.57 8.12 2.98 196.95 222.86 163.56
nbody 1.0 0.99 1.05 7.46 9.22 3.42 241.77 345.75 96.14
norm 1.0 1.04 1.09 5.27 8.42 3.21 303.29 400.41 145.18
trees 1.0 0.63 0.66 0.84 0.66 1.35 12.66 12.84 8.29
pent 1.0 0.96 1.02 6.16 8.86 3.14 114.18 159.37 146.91
red-black 1.0 0.99 1.0 1.0 1.0 1.31 2.24 2.52 3.73
bench 1.0 0.99 1.07 14.78 12.45 3.55 194.38 257.93 144.25
GeoMean. 1.0 0.98 1.04 4.15 6.44 2.54 161.76 198.86 79.65

Optcarrot results

  • Graal crashes with convert_type exception
  • MJIT with LLVM has the best results
  • Although JRuby produces decent FPS, it requires too much CPU resources and memory
  • Optcarrot results are different from microbenchmark ones:
    • MJIT with LLVM produces better wall time results than with GCC

2000 frames

  • Frames Per Second (Speedup = FPS / 'MRI v2.0 FPS'):

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
FPS 28.96 39.43 35.76 78.22 90.58 35.37 33.52 68.73 -
Speedup 1.0 1.36 1.23 2.7 3.13 1.22 1.16 2.37 -

  • CPU time ('CPU MRI v2.0 time' / 'CPU time'):

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
Speedup 1.0 1.32 1.2 1.49 1.51 1.15 0.8 0.76 -

  • Peak Memory ('max resident memory' / 'max redsident MRI v2.0 memory'):

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
Peak Mem 1.0 1.0 1.1 1.16 1.16 1.41 10.05 16.56 -

Default number of frames (180 frames)

  • Frames Per Second (Speedup = FPS / 'MRI v2.0 FPS'):

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
FPS 29.02 39.71 35.75 82.06 85.43 34.94 33.09 69.11 -
Speedup 1.0 1.37 1.23 2.83 2.94 1.2 1.14 2.38 -

  • CPU time ('CPU MRI v2.0 time' / 'CPU time'):

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
Speedup 1.0 1.33 1.2 1.53 1.45 1.13 0.79 0.76 -

  • Peak Memory ('max resident memory' / 'max redsident MRI v2.0 memory'):

v2 base rtl mjit mjit-cl omr jruby9k jruby9k-d graal-22
Peak Mem 1.0 1.0 1.1 1.16 1.16 1.41 10.67 17.68 -