# High-performance computing on your laptop I: inference, compilation, and performance measurement

![julia logo](figures/julia_logo.png)

Timothy E. Holy

Washington University in St. Louis

# Performance!

![fast car](figures/fast_car.jpg)

## ...but it's also possible to get this:

![slow truck](figures/slow_truck.jpg)

Today's goal: learn how to make Julia "not slow" when needed

# Julia has good tools for helping you discover where things went wrong

![tools](figures/tools.png)

## ...but your most important tool is this one:

![brain](figures/brain.jpg)

# A tutorial on Julia's inner workings

- type inference
- methods and specialization
- runtime vs compiletime dispatch
- benchmarking
- profiling

Much of this continues from material that we hinted at in the introduction; now that you've had some time to learn Julia, let's dive a little deeper.

This is a bit complex, but mastering it provides the foundation you need to quickly become an expert Julia developer!

# A trivial (but revealing) example, in-depth

In [None]:
add2(x) = x[1] + x[2]

In [None]:
add2( [1.0, 2.0] )

In [None]:
add2( [1, 2] )

In [None]:
methods(add2)

In [None]:
m = @which add2([1, 2])
m

In [None]:
using MethodAnalysis
methodinstances(m)

We get two *MethodInstances* from a single *Method*: **compiler specialization**.

In [None]:
# Let's make another!
add2( (1, 2.0) )

In [None]:
methodinstances(m)

Julia creates these the first time you call `add2` with a new type.

On later calls, Julia just uses the code it has already compiled.

## What are the differences between these MethodInstances?

Julia lets you see how this works:

In [None]:
@code_lowered add2([1, 2])  # represents the Method, not a particular MethodInstance

`x[1]` is implemented as a call `getindex(x, 1)`. The `getindex` function is defined in `Base`.

`%1`, `%2`, `%3` are like temporary variables. (Compiler lingo: [single static assignment (SSA) values](https://en.wikipedia.org/wiki/Static_single_assignment_form).)


Some of the other markers indicate blocks of code that execute without any branches (no `if`, `while`, etc. within the block)

In [None]:
@code_typed optimize=false add2([1, 2])    # represents the specific MethodInstance

In [None]:
@code_typed optimize=false add2([1.0, 2.0])

In [None]:
@code_typed optimize=false add2((1, 2.0))

## How does Julia know this?

In [None]:
typeof([1, 2])

In [None]:
@code_typed optimize=false getindex([1, 2], 1)

In [None]:
@code_typed optimize=false 1+2

In [None]:
@code_typed optimize=false Base.add_int(1, 2)

In [None]:
@code_typed optimize=false 1 + 2.0

In [None]:
@code_typed optimize=true 1 + 2.0

You can look even deeper with `@code_llvm` (shows the final result of Julia's compiler before handing the code off to [LLVM](https://llvm.org/)) and `@code_native` (the final CPU instructions optimized for your specific machine).

# A summary of how Julia builds code

Strings in your files get *parsed* into Julia `Expr`essions

`Expr`essions get *lowered* (like `@code_lowered`).

**Tip**: This is mostly what gets saved to a file when you see `[ Info: Precompiling...`

When you call `f(args...)`, either:
- the native code gets looked up in the in-memory storage and then run
- OR:
  + `f` gets type-inferred for the argument types
  + the type-inferred code gets optimized by Julia (this gets stored in memory)
  + the result gets compiled by LLVM (this also gets stored in memory)
  + Julia runs the compiled code

# Dispatch: runtime vs compile-time

Remember that `add2` is defined as

```julia
add2(x) = x[1] + x[2]
```

However:

In [None]:
+

Which of these 190 methods gets called?

In [None]:
@code_typed optimize=false add2( (1, 2.0) )

In [None]:
@code_typed optimize=true add2((1, 2.0))   # optimize=true performing inlining: https://en.wikipedia.org/wiki/Inline_expansion

Julia didn't have to ask that question when the function was running: the types could be inferred and so Julia could determine which method was applicable when the code was being compiled.

"Compile-time dispatch"

In [None]:
@code_typed optimize=true add2( Any[1, 2.0] )   # call it on a `Vector{Any}`

It doesn't get very far in optimizing your code.

"Run-time dispatch": types have to be checked while the code is running, and then the appropriate method chosen, possibly compiled, and executed.

# Runtime vs compiletime dispatch

A compiled function is a "blob" of native code living in a particular memory location.

Calling a function involves:
- preparing the arguments
- deciding *which* specific compiled blob to use. This is like looking up someone's phone number in the phone book. Julia literally scans through the method tables.

This decision can be made during *runtime* (when code is executing) or during *compiletime* (when Julia is compiling the function).

Schematic of a compiletime call in pseudo-Julia:
```julia
push!(execution_stack, args)
@goto compiled_blob_52383
```
(The blob will retrieve the argument values by [popping the execution stack](https://en.wikipedia.org/wiki/Call_stack).)

Schematic of a runtime call in pseudo-Julia:
```julia
# scan the method tables and their lists of compiled blobs for a match
# if the right blob hasn't been compiled yet, compile it now
blob = get_blob_for_argtypes(f, typesof(args))
# The rest looks the same as a compiletime call:
push!(execution_stack, args)
goto(blob)
```

# Comparing the performance of runtime vs compile-time dispatch

We'll use the [BenchmarkTools](https://github.com/JuliaCI/BenchmarkTools.jl) package.

In [None]:
using BenchmarkTools
@btime add2( (1, 2.0) ) # setup=(x = rand(1:5); y = rand())

In [None]:
@btime add2(z) setup=(z = rand(2))

In [None]:
@btime add2(z) setup=(x = rand(1:5); y = rand(); z = Any[x, y])

Ballpark costs of runtime dispatch:
- single argument: 15-35ns
- two arguments: ~100ns
- ...


Runtime dispatch is *slow*. It's also the most common reason to get performance that disappoints.

## Union-splitting

An intermediate case is [Union-splitting](https://julialang.org/blog/2018/08/union-splitting/), where Julia can determine that there are only a few possible argument types:

In [None]:
@btime add2(z) setup=(x = rand(Float32); y = rand(); z = Union{Float32,Float64}[x, y])

```julia
argtypes = typesof(args)
push!(execution_stack, args)
if argtypes === Tuple{Float32}
    @goto compiled_blob_52383
else # the only other option is Tuple{Float64}
    @goto compiled_blob_52951
end
```
Note the absence of the need to call `get_blob_for_argtypes`. Union-splitting generalizes compiletime dispatch.

# Profiling

*Profiling* allows you to measure where you code spends its time.

An *instrumenting profiler* adds measurement "instrumentation" to your source code: if you wrote

```julia
y = f(x)                  # this is line 17
z = g(x, y)               # this is line 18
```

an instrumenting profiler might effectively turn this into

```julia
push!(timebuffer, ProfileInfo("somefile.jl", 17, time()))
y = f(x)                  # this is line 17
push!(timebuffer, ProfileInfo("somefile.jl", 18, time()))
z = g(x, y)               # this is line 18
push!(timebuffer, ProfileInfo("somefile.jl", 19, time()))
...
```

Problems with instrumenting profilers:
- instrumentation slows your code
- instrumentation can block compiler optimizations: the compiled code of the real version may be quite different from that of the instrumented code "minus instrumentation"
- recursion is a bit tricky: if you also instrument `f`, the added instrumentation inside `f` distorts your measurement of the runtime of `f` itself.

## Sampling profilers

A sampling profiler periodically interrupts your code and collections program-location data.

Analogy: a person spends
- 8 hours a day at work
- 1 hour a day at the gym

At random times, a friend texts and says "where r u?" Do this hundreds of times over a month.

In [None]:
using PyPlot
location = [rand(1:9) <= 8 ? "work" : "gym" for i = 1:1000]
hist(location)
ylabel("Counts");

Approximately 8x more of the locations were "work", so you infer the person spends ≈8x more time at work.

Disadvantages of sampling profilers:

- you don't collect exhaustive information: spending 5 minutes at the bank will only rarely be captured at all
- it's subject to sampling noise (given `n` counts, you have an uncertainty of approx. `√n` counts)

Advantages of sampling profilers:

- you're running *unmodified* code

# Demo of profiling

In [None]:
# could also use `sleep` but it gives more complicated profiling results
function busywait(t)
    x = 0
    for i = 1:round(Int, 2.1e10*t)
        x += i % 2
    end
    return x    # return `x` to prevent the compiler from noticing that this does't do real work & eliminating it
end

In [None]:
@time busywait(0.8)

In [None]:
function mydays(n)
    x = 0
    for i = 1:n
        x += work()
        x += gym()
    end
    return x
end
@noinline work() = busywait(0.08)
@noinline gym() = busywait(0.01)

In [None]:
mydays(1)    # run once to compile it

In [None]:
using Profile
Profile.clear()     # clear old results (not really needed on the first usage)
@profile mydays(30)

In [None]:
Profile.print(format=:flat)

In [None]:
Profile.print(format=:tree)

## Visualization of profile data

In [None]:
using ProfileSVG       # for "real" work, ProfileView is recommeded instead
ProfileSVG.view()

"FlameGraph":
- height encodes the call depth (bars get called by the bar below them)
- width is proportional to run time

# Performance profiling in action

In [None]:
function mult(A, B, x)
    C = A * B
    return C * x
end

A = rand(10000, 2)
B = rand(2, 8000)
x = rand(8000)

mult(A, B, x);

In [None]:
@time mult(A, B, x);

In [None]:
@profview mult(A, B, x)  

In [None]:
function mult2(A, B, x)
#     C = A * B
#     return C * x
    y = B * x
    return A * y
end

mult2(A, B, x) ≈ mult(A, B, x)

In [None]:
@time mult2(A, B, x);

We'll cover *why* this is better in the next session.

# Using profiling to detect "gotchas"

Recall that `add2` was slow when passed a `Vector{Any}`, but fast for a `Vector{T}` with concrete `T`

In [None]:
@btime add2(z) setup=(x = rand(1:5); y = rand(1:5); z = [x, y])

In [None]:
@btime add2(z) setup=(x = rand(1:5); y = rand(1:5); z = Any[x, y])

In [None]:
@bprofile add2(z) setup=(x = rand(1:5); y = rand(1:5); z = Any[x, y])

In [None]:
ProfileSVG.view()

- red: runtime dispatch
- yellow/orange: memory cleanup (correlates with memory allocation)

(Demo using ProfileView to detect & diagnose type problems)

# Summary

Julia *can* be fast. But you need to learn enough to avoid some common gotchas:

- don't use non-`const` global variables (see homework)
- don't use containers like arrays with non-concrete types unless absolutely necessary
- measure & analyze performance with `@time`, BenchmarkTools, and `@profile`/ProfileView