# Understandable performance
*Going fast, nowhere*

In [13]:
using Pkg
Pkg.activate(;temp=true)
Pkg.add(["BenchmarkTools", "LLVM", "Unitful"])

[32m[1m  Activating[22m[39m new project at `/tmp/jl_skn2MW`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Unitful ─ v1.10.1
[32m[1m    Updating[22m[39m `/tmp/jl_skn2MW/Project.toml`
 [90m [6e4b80f9] [39m[92m+ BenchmarkTools v1.2.2[39m
 [90m [929cbde3] [39m[92m+ LLVM v4.7.1[39m
 [90m [1986cc42] [39m[92m+ Unitful v1.10.1[39m
[32m[1m    Updating[22m[39m `/tmp/jl_skn2MW/Manifest.toml`
 [90m [6e4b80f9] [39m[92m+ BenchmarkTools v1.2.2[39m
 [90m [fa961155] [39m[92m+ CEnum v0.4.1[39m
 [90m [187b0558] [39m[92m+ ConstructionBase v1.3.0[39m
 [90m [692b3bcd] [39m[92m+ JLLWrappers v1.4.0[39m
 [90m [682c06a0] [39m[92m+ JSON v0.21.2[39m
 [90m [929cbde3] [39m[92m+ LLVM v4.7.1[39m
 [90m [69de0a69] [39m[92m+ Parsers v2.2.0[39m
 [90m [21216c6a] [39m[92m+ Preferences v1.2.3[39m
 [90m [1986cc42] [39m[92m+ Unitful v1.10.1[39m
 [90m [dad2f222] [39m[92m+ LLVMExtra_jll v0.0.13+1[39m
 [90m [0dad84c5] [39m[92

## A note on benchmarking
*Premature optimization is the root of all evil* & *If you don't measure you won't improve*

### Tools
1. BenchmarkTools.jl https://github.com/JuliaCI/BenchmarkTools.jl
2. Profiler https://docs.julialang.org/en/latest/manual/profile/
3. ProfileView.jl https://github.com/timholy/ProfileView.jl
4. PProf https://github.com/vchuravy/PProf.jl


### Other
1. VTunes/Perf/OProfile https://docs.julialang.org/en/latest/manual/profile/#External-Profiling-1
2. LIKWID 
3. LinuxPerf
4. MCAnalyzer

## BenchmarkTools.jl
Solid package that tries to eliminate common pitfalls in performance measurment.
- `@benchmark` macro that will repeatedly evaluate your code to gain enough samples
- Caveat: You probably want to escape `$` your input data

In [5]:
data = rand(2^10);

In [6]:
function sum(X::Vector{Float64})
    acc = 0::Int64
    for x in X
        (acc += x)::Float64
    end
    acc::Union{Int64, Float64}
end

sum (generic function with 1 method)

In [7]:
using BenchmarkTools
@benchmark sum($data)

BenchmarkTools.Trial: 10000 samples with 123 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m745.691 ns[22m[39m … [35m1.104 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m746.504 ns             [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m748.123 ns[22m[39m ± [32m8.416 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▇[34m█[39m[39m▆[39m [32m [39m[39m [39m [39m [39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[39m█[32m▆

## Figuring out what is happening
The stages of the compiler
- `@code_lowered`
- `@code_typed` & `@code_warntype`
- `@code_llvm`
- `@code_native`

Where is a function defined
`@which` & `@edit`

In [8]:
@code_lowered sum(data)

CodeInfo(
[90m1 ─[39m       acc = Core.typeassert(0, Main.Int64)
[90m│  [39m %2  = X
[90m│  [39m       @_3 = Base.iterate(%2)
[90m│  [39m %4  = @_3 === nothing
[90m│  [39m %5  = Base.not_int(%4)
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3
[90m│  [39m       x = Core.getfield(%7, 1)
[90m│  [39m %9  = Core.getfield(%7, 2)
[90m│  [39m %10 = acc + x
[90m│  [39m       acc = %10
[90m│  [39m       Core.typeassert(%10, Main.Float64)
[90m│  [39m       @_3 = Base.iterate(%2, %9)
[90m│  [39m %14 = @_3 === nothing
[90m│  [39m %15 = Base.not_int(%14)
[90m└──[39m       goto #4 if not %15
[90m3 ─[39m       goto #2
[90m4 ┄[39m %18 = acc
[90m│  [39m %19 = Core.apply_type(Main.Union, Main.Int64, Main.Float64)
[90m│  [39m %20 = Core.typeassert(%18, %19)
[90m└──[39m       return %20
)

# A simple example: counting

In [11]:
function f(N)
    acc = 0
    for i in 1:N
        acc += 1
    end
    return acc
end

f (generic function with 1 method)

In [12]:
N = 100_000_000
result = @benchmark f($N)

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.249 ns[22m[39m … [35m26.100 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.260 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.261 ns[22m[39m ± [32m 0.275 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▃[39m [39m [39m [39m [39m▇[39m [39m [39m [39m [39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[34m [39m[39m▁
  [39m█[39m▁[39m▁[39m▁[39m▁[39m█[39m▁[39m▁

In [15]:
using Unitful

In [16]:
t = time(minimum(result)) * u"ns" # in ns
pFreq = N/t |> u"PHz"
t, pFreq

(1.249 ns, 80.06405124099278 PHz)

So we are doing 100 million additions in 1.2ns.
So our processor is operating at 80 PHz...

We wish...

What is going on?

Let's do a basic check, 10x bigger input

In [17]:
@benchmark f($(10*N))

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.249 ns[22m[39m … [35m5.190 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.260 ns             [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.260 ns[22m[39m ± [32m0.067 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▂[39m [39m [39m [39m [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m▁[39m▁[39m▁[39m▁[34m█[39m[39m▁[39m▁[39m

Let's explore what code we are **actually** running. 

Using Julia's reflection macros we can see all of the stages of code-generation.

In [18]:
@code_lowered f(N)

CodeInfo(
[90m1 ─[39m       acc = 0
[90m│  [39m %2  = 1:N
[90m│  [39m       @_3 = Base.iterate(%2)
[90m│  [39m %4  = @_3 === nothing
[90m│  [39m %5  = Base.not_int(%4)
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3
[90m│  [39m       i = Core.getfield(%7, 1)
[90m│  [39m %9  = Core.getfield(%7, 2)
[90m│  [39m       acc = acc + 1
[90m│  [39m       @_3 = Base.iterate(%2, %9)
[90m│  [39m %12 = @_3 === nothing
[90m│  [39m %13 = Base.not_int(%12)
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m       goto #2
[90m4 ┄[39m       return acc
)

In [19]:
@code_typed optimize=false f(N)

CodeInfo(
[90m1 ─[39m       (acc = 0)[90m::Core.Const(0)[39m
[90m│  [39m %2  = (1:N)[36m::Core.PartialStruct(UnitRange{Int64}, Any[Core.Const(1), Int64])[39m
[90m│  [39m       (@_3 = Base.iterate(%2))[90m::Union{Nothing, Tuple{Int64, Int64}}[39m
[90m│  [39m %4  = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))[90m::Int64[39m
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (acc = acc + 1)[90m::Int64[39m
[90m│  [39m       (@_3 = Base.iterate(%2, %9))[90m::Union{Nothing, Tuple{Int64, Int64}}[39m
[90m│  [39m %12 = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m       goto #2
[90m4 ┄[39m       return acc
) => Int64

In [20]:
@code_typed optimize=true f(N)

CodeInfo(
[90m1 ──[39m %1  = Base.sle_int(1, N)[36m::Bool[39m
[90m│   [39m %2  = Base.ifelse(%1, N, 0)[36m::Int64[39m
[90m│   [39m %3  = Base.slt_int(%2, 1)[36m::Bool[39m
[90m└───[39m       goto #3 if not %3
[90m2 ──[39m       Base.nothing[90m::Nothing[39m
[90m└───[39m       goto #4
[90m3 ──[39m       goto #4
[90m4 ┄─[39m %8  = φ (#2 => true, #3 => false)[36m::Bool[39m
[90m│   [39m %9  = φ (#3 => 1)[36m::Int64[39m
[90m│   [39m %10 = Base.not_int(%8)[36m::Bool[39m
[90m└───[39m       goto #10 if not %10
[90m5 ┄─[39m %12 = φ (#4 => %9, #9 => %21)[36m::Int64[39m
[90m│   [39m %13 = φ (#4 => 0, #9 => %14)[36m::Int64[39m
[90m│   [39m %14 = Base.add_int(%13, 1)[36m::Int64[39m
[90m│   [39m %15 = (%12 === %2)[36m::Bool[39m
[90m└───[39m       goto #7 if not %15
[90m6 ──[39m       Base.nothing[90m::Nothing[39m
[90m└───[39m       goto #8
[90m7 ──[39m %19 = Base.add_int(%12, 1)[36m::Int64[39m
[90m└───[39m       goto #8
[90m8 ┄─[39m 

In [21]:
@code_llvm optimize=false f(10)

[90m;  @ In[11]:1 within `f`[39m
[95mdefine[39m [36mi64[39m [93m@julia_f_3683[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%1 [0m= [96m[1mcall[22m[39m [33m{[39m[33m}[39m[0m*** [93m@julia.get_pgcstack[39m[33m([39m[33m)[39m
  [0m%2 [0m= [96m[1mbitcast[22m[39m [33m{[39m[33m}[39m[0m*** [0m%1 [95mto[39m [33m{[39m[33m}[39m[0m**
  [0m%current_task [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m{[39m[33m}[39m[0m*[0m, [33m{[39m[33m}[39m[0m** [0m%2[0m, [36mi64[39m [33m2305843009213693940[39m
  [0m%3 [0m= [96m[1mbitcast[22m[39m [33m{[39m[33m}[39m[0m** [0m%current_task [95mto[39m [36mi64[39m[0m*
  [0m%world_age [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [36mi64[39m[0m, [36mi64[39m[0m* [0m%3[0m, [36mi64[39m [33m13[39m
[90m;  @ In[11]:3 within `f`[39m
[90m; ┌ @ range.jl:5 within `Colon`[39m
[90m; │┌ @ range.jl:354 wit

In [22]:
@code_llvm optimize=true f(10)

[90m;  @ In[11]:1 within `f`[39m
[95mdefine[39m [36mi64[39m [93m@julia_f_3706[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ In[11]:3 within `f`[39m
[90m; ┌ @ range.jl:5 within `Colon`[39m
[90m; │┌ @ range.jl:354 within `UnitRange`[39m
[90m; ││┌ @ range.jl:359 within `unitrange_last`[39m
     [0m%.inv [0m= [96m[1micmp[22m[39m [96m[1msgt[22m[39m [36mi64[39m [0m%0[0m, [33m0[39m
[90m; └└└[39m
  [0m%spec.select [0m= [96m[1mselect[22m[39m [36mi1[39m [0m%.inv[0m, [36mi64[39m [0m%0[0m, [36mi64[39m [33m0[39m
[90m;  @ In[11]:6 within `f`[39m
  [96m[1mret[22m[39m [36mi64[39m [0m%spec.select
[33m}[39m


In [23]:
@code_native f(10)

	[0m.text
[90m; ┌ @ In[11]:3 within `f`[39m
	[96m[1mmovq[22m[39m	[0m%rdi[0m, [0m%rax
	[96m[1msarq[22m[39m	[33m$63[39m[0m, [0m%rax
	[96m[1mandnq[22m[39m	[0m%rdi[0m, [0m%rax[0m, [0m%rax
[90m; │ @ In[11]:6 within `f`[39m
	[96m[1mretq[22m[39m
	[96m[1mnopl[22m[39m	[33m([39m[0m%rax[33m)[39m
[90m; └[39m


# Conclusion

LLVM realised that our loop.

```julia
for i in 1:N
  acc += 1
end
```

Just ended up being $acc = 1 * N$

# Exercise

What happens with:

```julia
function g(N)
    acc = 0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```

and
    
```julia
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```

Take some time to study the different stages of the compilation pipeline.
    

In [24]:
function g(N)
    acc = 0
    for i in 1:N
        acc += 1.0
    end
    acc
end

g (generic function with 1 method)

In [26]:
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end

h (generic function with 1 method)

In [28]:
function g(::Type{T}, N) where T<:Number
    acc = zero(T)
    for i in 1:N
        acc += one(T)
    end
    acc
end

g (generic function with 2 methods)

# Can we actually measure the speed of our original code?

In [10]:
##########################
# Low-level benchmarking #
##########################
using LLVM
using LLVM.Interop

 """
    clobber()

Force the compiler to flush pending writes to global memory.
Acts as an effective read/write barrier.
"""
@inline clobber() = @asmcall("", "~{memory}", true) 

"""
    escape(val)

The `escape` function can be used to prevent a value or
expression from being optimized away by the compiler. This function is
intended to add little to no overhead.
See: https://youtu.be/nXaxk27zwlk?t=2441
"""
@inline escape(val::T) where T = @asmcall("", "X,~{memory}", true, Nothing, Tuple{T}, val)

┌ Info: Precompiling LLVM [929cbde3-209d-540e-8aea-75f648917ca0]
└ @ Base loading.jl:1423


escape

In [44]:
function k(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
        clobber()
    end
    return acc
end

k (generic function with 1 method)

In [52]:
@code_llvm debuginfo=:none k(Int64, 10)

[95mdefine[39m [36mi64[39m [93m@julia_k_4629[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%.inv [0m= [96m[1micmp[22m[39m [96m[1msgt[22m[39m [36mi64[39m [0m%0[0m, [33m0[39m
  [0m%1 [0m= [96m[1mselect[22m[39m [36mi1[39m [0m%.inv[0m, [36mi64[39m [0m%0[0m, [36mi64[39m [33m0[39m
  [96m[1mbr[22m[39m [36mi1[39m [0m%.inv[0m, [36mlabel[39m [91m%L12[39m[0m, [36mlabel[39m [91m%L29[39m

[91mL12:[39m                                              [90m; preds = %L12, %top[39m
  [0m%value_phi2 [0m= [96m[1mphi[22m[39m [36mi64[39m [33m[[39m [0m%2[0m, [91m%L12[39m [33m][39m[0m, [33m[[39m [33m1[39m[0m, [91m%top[39m [33m][39m
  [96m[1mcall[22m[39m [36mvoid[39m [95masm[39m [95msideeffect[39m [0m""[0m, [0m"~{memory}"[33m([39m[33m)[39m [0m#2
  [0m%.not [0m= [96m[1micmp[22m[39m [96m[1meq[22m[39m [36mi64[39m [0m%value_phi2[0m, [0m%1
  [

In [53]:
@code_native debuginfo=:none  k(Int64, 10)

	[0m.text
	[96m[1mtestq[22m[39m	[0m%rdi[0m, [0m%rdi
	[96m[1mjle[22m[39m	[91mL38[39m
	[96m[1mmovq[22m[39m	[0m%rdi[0m, [0m%rax
	[96m[1msarq[22m[39m	[33m$63[39m[0m, [0m%rax
	[96m[1mandnq[22m[39m	[0m%rdi[0m, [0m%rax[0m, [0m%rax
	[96m[1mmovq[22m[39m	[0m%rax[0m, [0m%rcx
	[96m[1mnopw[22m[39m	[0m%cs[0m:[33m([39m[0m%rax[0m,[0m%rax[33m)[39m
[91mL32:[39m
	[96m[1mdecq[22m[39m	[0m%rcx
	[96m[1mjne[22m[39m	[91mL32[39m
	[96m[1mretq[22m[39m
[91mL38:[39m
	[96m[1mxorl[22m[39m	[0m%eax[0m, [0m%eax
	[96m[1mretq[22m[39m
	[96m[1mnopl[22m[39m	[33m([39m[0m%rax[33m)[39m


In [49]:
function m(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
        escape(acc)
    end
    return acc
end

m (generic function with 1 method)

In [54]:
@code_llvm debuginfo=:none m(Int64, 30)

[95mdefine[39m [36mi64[39m [93m@julia_m_4634[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%.inv [0m= [96m[1micmp[22m[39m [96m[1msgt[22m[39m [36mi64[39m [0m%0[0m, [33m0[39m
  [0m%1 [0m= [96m[1mselect[22m[39m [36mi1[39m [0m%.inv[0m, [36mi64[39m [0m%0[0m, [36mi64[39m [33m0[39m
  [96m[1mbr[22m[39m [36mi1[39m [0m%.inv[0m, [36mlabel[39m [91m%L12[39m[0m, [36mlabel[39m [91m%L29[39m

[91mL12:[39m                                              [90m; preds = %L12, %top[39m
  [0m%value_phi2 [0m= [96m[1mphi[22m[39m [36mi64[39m [33m[[39m [0m%3[0m, [91m%L12[39m [33m][39m[0m, [33m[[39m [33m1[39m[0m, [91m%top[39m [33m][39m
  [0m%value_phi3 [0m= [96m[1mphi[22m[39m [36mi64[39m [33m[[39m [0m%2[0m, [91m%L12[39m [33m][39m[0m, [33m[[39m [33m0[39m[0m, [91m%top[39m [33m][39m
  [0m%2 [0m= [96m[1madd[22m[39m [95mnuw[39m [95mnsw[39m [36

In [36]:
result2 = @benchmark m($Int64, $N)

BenchmarkTools.Trial: 199 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m24.708 ms[22m[39m … [35m46.069 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m25.011 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m25.232 ms[22m[39m ± [32m 1.643 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▃[39m [39m█[34m▄[39m[39m▃[32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[34m█[39m[39m█[32m▅

In [37]:
@benchmark m($Int64, $(N*10))

BenchmarkTools.Trial: 20 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m248.565 ms[22m[39m … [35m261.804 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m249.763 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m251.196 ms[22m[39m ± [32m  3.736 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m [39m▁[39m [39m [34m▁[39m[39m▁[39m▁[39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m▆[39m█[39m▆

In [38]:
t = time(minimum(result2)) * u"ns" # in ns
pFreq = N/t |> u"GHz"

4.047241671242073 GHz

Pfff, this makes sense, 4 GHz is much closer to the frequency of my actual processor 

Note: Benchmarking is hard, careful evalutaion of *what* you are trying to benchmark.

- If we were just interesting in how fast `f(N)` was we would have been fine with our first measurement
- But we were interested in the speed of addition as a proxy of perfromance
- Integer math on a computer is associative, Floating-Point math is not.

## Coming back to `h`
    
```julia
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```

In [55]:
@benchmark h($N)

BenchmarkTools.Trial: 68 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m74.130 ms[22m[39m … [35m74.243 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m74.177 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m74.179 ms[22m[39m ± [32m28.005 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m [39m█[39m [39m [39m▃[39m▃[39m [39m▃[39m [39m█[39m [39m▃[39m [39m█[39m [39m▃[39m▃[34m [39m[39m [32m [39m[39m [39m▃[39m [39m [39m█[39m [39m [39m█[39m█[39m [39m▃[39m [39m▃[39m▃[39m [39m [39m▃[39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▇[39m▇[39m▇[39m▇[39m▇[39m▇[39m▇

In [56]:
function l(N)
    acc = 0.0
    @simd for i in 1:N
        acc += 1.0
    end
    acc
end

l (generic function with 1 method)

In [57]:
@benchmark l($(N))

BenchmarkTools.Trial: 1069 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m4.631 ms[22m[39m … [35m 5.352 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m4.642 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.674 ms[22m[39m ± [32m64.570 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m▇[39m▅[34m▄[39m[39m▃[39m▁[39m [39m [39m [39m [32m▁[39m[39m▂[39m▃[39m▂[39m▁[39m▁[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[34m█[39m[39m█[39m█[39m█[

# Performance annotiations in Julia

- https://docs.julialang.org/en/v1/manual/performance-tips/
- Julia does bounds checking by default `ones(10)[11]` is an error
- `@inbounds` Turns of bounds-checking locally
- `@fastmath` Turns of strict IEEE749 locally -- be very careful this might not to what you want
- `@simd` and `@simd ivdep` stronger gurantuees to encourage LLVM to use SIMD operations

In [39]:
?@simd

```
@simd
```

Annotate a `for` loop to allow the compiler to take extra liberties to allow loop re-ordering

!!! warning
    This feature is experimental and could change or disappear in future versions of Julia. Incorrect use of the `@simd` macro may cause unexpected results.


The object iterated over in a `@simd for` loop should be a one-dimensional range. By using `@simd`, you are asserting several properties of the loop:

  * It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
  * Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In many cases, Julia is able to automatically vectorize inner for loops without the use of `@simd`. Using `@simd` gives the compiler a little extra leeway to make it possible in more situations. In either case, your inner loop should have the following properties to allow vectorization:

  * The loop must be an innermost loop
  * The loop body must be straight-line code. Therefore, [`@inbounds`](@ref) is   currently needed for all array accesses. The compiler can sometimes turn   short `&&`, `||`, and `?:` expressions into straight-line code if it is safe   to evaluate all operands unconditionally. Consider using the [`ifelse`](@ref)   function instead of `?:` in the loop if it is safe to do so.
  * Accesses must have a stride pattern and cannot be "gathers" (random-index   reads) or "scatters" (random-index writes).
  * The stride should be unit stride.

!!! note
    The `@simd` does not assert by default that the loop is completely free of loop-carried memory dependencies, which is an assumption that can easily be violated in generic code. If you are writing non-generic code, you can use `@simd ivdep for ... end` to also assert that:


  * There exists no loop-carried memory dependencies
  * No iteration ever waits on a previous iteration to make forward progress.


In [59]:
@code_llvm debuginfo=:none l(10)

[95mdefine[39m [36mdouble[39m [93m@julia_l_4677[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%.inv [0m= [96m[1micmp[22m[39m [96m[1msgt[22m[39m [36mi64[39m [0m%0[0m, [33m0[39m
  [0m%1 [0m= [96m[1mselect[22m[39m [36mi1[39m [0m%.inv[0m, [36mi64[39m [0m%0[0m, [36mi64[39m [33m0[39m
  [96m[1mbr[22m[39m [36mi1[39m [0m%.inv[0m, [36mlabel[39m [91m%L14.preheader[39m[0m, [36mlabel[39m [91m%L38[39m

[91mL14.preheader:[39m                                    [90m; preds = %top[39m
  [0m%min.iters.check [0m= [96m[1micmp[22m[39m [96m[1mult[22m[39m [36mi64[39m [0m%1[0m, [33m16[39m
  [96m[1mbr[22m[39m [36mi1[39m [0m%min.iters.check[0m, [36mlabel[39m [91m%L14[39m[0m, [36mlabel[39m [91m%vector.ph[39m

[91mvector.ph:[39m                                        [90m; preds = %L14.preheader[39m
  [0m%n.vec [0m= [96m[1mand[22m[39m [36mi64[39m [0m%1

# Let's revisit our example from earlier!

Slightly more complicated function!

- What is wrong with `mysum3(ones(10_000))`

In [61]:
function mysum3(data::Vector{T}) where T<:Number
  acc = zero(T)
  @simd for x in data
      acc += x
  end
  return acc
end

mysum3 (generic function with 1 method)

In [62]:
@code_llvm mysum3(zeros(3))

[90m;  @ In[61]:1 within `mysum3`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_mysum3_4706[39m[33m([39m[33m{[39m[33m}[39m[0m* [95mnonnull[39m [95malign[39m [33m16[39m [95mdereferenceable[39m[33m([39m[33m40[39m[33m)[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ In[61]:3 within `mysum3`[39m
[90m; ┌ @ simdloop.jl:71 within `macro expansion`[39m
[90m; │┌ @ simdloop.jl:51 within `simd_inner_length`[39m
[90m; ││┌ @ array.jl:215 within `length`[39m
     [0m%1 [0m= [96m[1mbitcast[22m[39m [33m{[39m[33m}[39m[0m* [0m%0 [95mto[39m [33m{[39m [36mi8[39m[0m*[0m, [36mi64[39m[0m, [36mi16[39m[0m, [36mi16[39m[0m, [36mi32[39m [33m}[39m[0m*
     [0m%2 [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m{[39m [36mi8[39m[0m*[0m, [36mi64[39m[0m, [36mi16[39m[0m, [36mi16[39m[0m, [36mi32[39m [33m}[39m[0m, [33m{[39m [36mi8[39m[0m*[0m, [36mi64[39m[0m, [36mi16[39m[0m, [36mi16[39m[0

    [0m%exitcond.not [0m= [96m[1micmp[22m[39m [96m[1meq[22m[39m [36mi64[39m [0m%23[0m, [0m%3
[90m; │└[39m
   [96m[1mbr[22m[39m [36mi1[39m [0m%exitcond.not[0m, [36mlabel[39m [91m%L18[39m[0m, [36mlabel[39m [91m%L10[39m

[91mL18:[39m                                              [90m; preds = %L10, %middle.block, %top[39m
   [0m%value_phi2 [0m= [96m[1mphi[22m[39m [36mdouble[39m [33m[[39m [33m0.000000e+00[39m[0m, [91m%top[39m [33m][39m[0m, [33m[[39m [0m%19[0m, [91m%middle.block[39m [33m][39m[0m, [33m[[39m [0m%22[0m, [91m%L10[39m [33m][39m
[90m; └[39m
[90m;  @ In[61]:6 within `mysum3`[39m
  [96m[1mret[22m[39m [36mdouble[39m [0m%value_phi2
[33m}[39m
