Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
383 lines (289 sloc) 12.9 KB

Optimizing RaptorJIT programs

Introduction

This is a book about how to make Lua code competitive with C, C++, and Rust.

Our trick will be to use a tracing just-in-time compiler to transform high-level Lua code into efficient low-level machine code. Our Lua code will freely use high-level abstractions even inside inner loops – tables, metatables, dynamically typed variables, first-class functions, and so on – and the compiler will see right through these abstractions at runtime and make them disappear. The final result will be tight machine code that is well suited to modern CPUs and competitive with statically typed code from a traditional C compiler.

Sounds too good to be true, right?

There is a catch: to write efficient code for a tracing JIT we will have to think different. If we write code with a traditional compiler in mind then it will not automatically take advantage of the JIT. That would be like writing SQL code to use with a NoSQL database, or local network code to use over a WAN, or SSD-optimized code to use on a hard disk. It will run, sure, but the performance will be surprising and often disappointing in practice.

Instead we will have to learn how the tracing JIT works so that we can take advantage of its strengths while avoiding its weaknesses. This is not necessarily difficult – the tracing JIT is actually simpler than a traditional compiler – but it does require new skills.

Specifically we will need to be able to:

  1. Write code in a style that the JIT can compile efficiently.
  2. Profile programs to identify which parts need to be improved.
  3. Optimize programs until we are satisfied with their profiles.

The purpose of this book is to help you learn these three skills.

RaptorJIT: Our Lua VM

Family tree: Lua, LuaJIT, RaptorJIT

Lua is a tasteful dynamic language.

LuaJIT is a faster alternative implementation for a dialect of Lua.

RaptorJIT is a hard-fork of LuaJIT.

Objections

Is tracing JIT a dead-end?

Is LuaJIT dead?

Is “competitive with C” just hyperbole?

To keep LuaJIT alive and make it easier to use for system programming.

Why this book?

Because application programmers need to know how to optimize programs for the JIT and not comprehensive documentation exists.

Tracing JIT

Tracing just-in-time compilation is a high-risk high-reward approach to optimizing programs written in a dynamic language.

Source code written in a dynamic language is hard to compile because many details have been deliberately left open. The only way to know what a program really does is to run it. This is exactly what the tracing JIT compiler does.

Specifically, the tracing JIT prepares to compile a piece of code by first running it once in an interpreter and recording all of the details that became apparent at runtime: the type of each local variable, the definition of each called function and method, and even which conditional branches were taken and which were not. The result is a highly detailed log of how that piece of code runs – or at least how it ran the one time that it was observed.

The tracing JIT then makes a gambit: it optimizes the code based on the prediction that the code will keep on running the same way in the future. This is a speculative optimization based on the intuition that a piece of code will tend to execute in the same way each time it is called in the same context. If the predictions usually come true at runtime then the code will run fast – competitive with C.

The key to optimizing for the tracing JIT is to understand how it works and then play to its strengths while avoiding its weaknesses.

What is a trace?

  • Linear sequence of instructions.
  • Guards.
  • Statically typed values.
  • Lua function calls inlined.

How the JIT creates traces for a program

(trace, guard, exit)

Categories of trace:

Root trace
Trace that starts on a bytecode instruction.
Loop trace
Trace that exactly covers one innermost loop. Typically a root trace, but can also be a side trace that diverges at the beginning of a root trace.
Function trace
Root trace that exactly covers one function and then returns to the interpreter.
Side trace
Trace that starts from a specific point of divergence (exit) from another trace.

Loop traces

Loop traces are the most important kind of trace. One loop trace is compiled for each of the innermost loops in the program source code, and these loops are compiled much more thoroughly than the rest of the code – typically twice as fast or faster – thanks to loop optimization.

Loop optimization compiles code to execute a series of iterations in the same loop rather than just one. This is very powerful because it allows compiler optimizations to span across multiple loop iterations instead of having to optimize each iteration separately without reference to the others.

Specifically,

  • Later iterations can reuse values that were loaded or calculated in earlier iterations;
  • Guards tested in the first iteration do not have to be retested in the following iterations;
  • Stores to Lua objects can be cached and updated between iterations of the loop, and then committed at the end.

The downside to loop optimization is that it is fragile: it only works when successive iterations of the loop follow exactly the same code path and “stay on trace.” Each time a loop iteration strays from the execution path recorded for the loop trace – exits onto a side trace – the following iteration will need to re-enter the loop by flushing cached stores, reloading referenced Lua objects, rechecking guard conditions, and so on. This means that the benefit of loop optimization is easily lost unless the innermost loops are written carefully.

Examples

Let us look at some example programs and think about which loop traces they will have and how those loops will be compiled.

First, here are two functions to calculate the sum and product of an array of numbers.

-- Return the sum of all numbers in array.
function sum(array)
  local acc = 0
  for _, x in ipairs(array)   (ref:sum-loop)
    acc = acc + x
  end
  return acc
end

-- Return the product of all numbers in array.
function product(array)
  local acc = 0
  for _, x in ipairs(array)   (ref:product-loop)
    acc = acc * x
  end
  return acc
end

There are two innermost loops in the source code: the for loop that computes a sum on line (sum-loop) and the for loop that computes a product on line (product-loop). The JIT will compile each of these innermost loops into a separate looping trace, and these traces will be efficient because each one always does the same thing.

Then, here is a different implementation of those same functions:

-- Return the sum of all numbers in array.
function sum(array)
  fold(array, 0, function(x,y) x+y end)
end

-- Return the product of all numbers in array.
function product(array)
  fold(array, 0, function(x,y) x*y end)
end

function fold(array, acc, fn)
  for _, x = ipairs(array)     (ref:fold-loop)
    acc = fn(acc, x)
  end
  return acc
end

The most obvious difference is that this version passes around higher-order functions and invokes a function object for each loop iteration. This is actually only a small difference from the compiler’s perspective though. The JIT always inlines function calls, even when dealing with higher-order functions, and so the apparent indirection in the source code is all optimized away during compilation.

The big difference is that now we only have one loop in the source code, the for loop on line (fold-loop), and this loop will sometimes do addition for sum but other times do multiplication for product. The compile can loop-optimize for one or the other of these cases, but not for both. In practice this means that only one use of our naive fold() function will be compiled efficiently as a loop trace and all other uses will be compiled inefficiently as side-traces exiting from that loop trace.

Side traces

A side trace represents a code path that diverges from a previously compiled parent trace at some specific point. The parent trace was specialized on certain specific conditions – the types of variables, the outcomes of conditional branches, the definitions of function objects – and when one of these conditions persistently fails to hold at runtime then a side trace is created to optimize an alternative execution path.

Side traces are specialized on specific conditions too. Each side trace represents just one alternative path. If many different paths are persistently taken at runtime then new side traces are created to handle each one. The result is a trace tree consisting of one root trace and a collection of side traces, side-side traces, side-side-side traces, etc (referred to simply as “side traces.”)

Combinatorial explosion is limited in practice both because each side trace automatically finishes when it reaches the start of a root trace for it to connect with and because the JIT limits the number of side traces it will create before falling back to the interpreter to handle further cases.

Example

Here is an example of a loop that changes its behaviour over time in ways that require side traces.

function foo(x) end
function bar(x) end
local hook = foo

local obj = 'a string'
for i = 1, 1000000 do
  if i >= 1000 then                     (ref:switched-bias)
    hook(obj)                           (ref:switched-call)
  end
  if i == 10000 then
    hook = bar                          (ref:switched-hook)
  end
  if i == 100000 then
    obj = {1,2,3,4,5}                   (ref:switched-type)
  end
end

The tracing JIT will start by compiling this into a loop trace specialized for the initial conditions where all of the if conditions evaluate to false and therefore none of the then clauses run.

Later, side traces will be compiled to handle divergences:

  1. From iteration 1000 onwards the first if condition on line (switched-bias) will evaluate true and the then clause will invoke the hook function. This control flow divergence will lead to a new side trace.
  2. From iteration 10000 the definition of the hook function will change. This will cause line (switched-call) to call a different function object and that will lead to a side trace.
  3. From iteration 100000 the type of obj will change from a string to an array (table.) This will change the type of the value passed to the hook function on line (switched-call) and therefore the type of the parameter variable x in the hook function. (Note: Because the compiler inlines all function calls we always need to take into account the code inside each function that is called in a trace.)

Note that no side traces are required for running the then clauses of the second and third if statements. These execution paths are not persistent – each occurs only once – and so they are handled by the interpreter instead of being JIT compiled.

Function traces

Function traces are a fallback.

Exceptional cases

Loop unrolling

Instability unrolling

Profiling

System profiling

Lua VM vs. libraries vs. kernel

CPU efficiency

Lua VM profiling

The ideal program

Interpreter time

Garbage collector time

Line vs. Loop time

Optimization

Profile interpretation patterns

Ideal profile

All time spent in JIT loops.

Healthy profile

Time is spent in JIT loops or else deliberate FFI/GC.

Disrupted compilation

Time spent in ->interp and/or ->return traces.

Mismatched branch bias

Side-traces taking more time than their parents.

Low loop factor

Low % of time is spent in looping machine code compared with line code.

Specific hazard anti-patterns

Closure creation (FNEW NYI)

Context

Trying to reduce Disrupted compilation.

Time is attributed to a ->interp trace that aborted due to NYI: FNEW.

Problem

Function closure is being created in performance sensitive code. This cannot be JITed.

Solution

Reformulate code to avoid creating a closure in this code.

Related

C-API call

Too many local variables

Disruptive branch

Poorly biased branch in a library routine disrupting the compilation of its caller.

Disruptive loop

Loop in a library routine preventing its caller from being an innermost loop.

Code optimization patterns

Biased branch

Fully biased branch

Hoisted test

Split loop

Sunk pointer [*]

Eliminated branch

Data optimization patterns

Freelist

FFI object

Reused C-type

You can’t perform that action at this time.