# Understandable performance
*Going fast, nowhere*

In [1]:
using Pkg
Pkg.activate("../envs/lecture2-2")
Pkg.instantiate()

[32m[1mActivating[22m[39m environment at `~/projects/julia-performance/envs/lecture2-2/Project.toml`
[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h

## A note on benchmarking
*Premature optimization is the root of all evil* & *If you don't measure you won't improve*

### Tools
1. BenchmarkTools.jl https://github.com/JuliaCI/BenchmarkTools.jl
2. Profiler https://docs.julialang.org/en/latest/manual/profile/
3. ProfileView.jl https://github.com/timholy/ProfileView.jl
4. VTunes/Perf/OProfile https://docs.julialang.org/en/latest/manual/profile/#External-Profiling-1
4. PProf https://github.com/vchuravy/PProf.jl

## BenchmarkTools.jl
Solid package that tries to eliminate common pitfalls in performance measurment.
- `@benchmark` macro that will repeatedly evaluate your code to gain enough samples
- Caveat: You probably want to escape `$` your input data

In [2]:
data = rand(2^10);

In [3]:
function sum(X::Vector{Float64})
    acc = 0::Int64
    for x::Float64 in X
        (acc += x)::Float64
    end
    acc::Union{Int64, Float64}
end

sum (generic function with 1 method)

In [4]:
using BenchmarkTools
@benchmark sum($data)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.372 μs (0.00% GC)
  median time:      1.459 μs (0.00% GC)
  mean time:        1.466 μs (0.00% GC)
  maximum time:     3.489 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

![Compiler](../imgs/compiler.png)

![Compiler Stages](../imgs/compiler-stages.png)

## Figuring out what is happening
The stages of the compiler
- `@code_lowered`
- `@code_typed` & `@code_warntype`
- `@code_llvm`
- `@code_native`

Where is a function defined
`@which` & `@edit`

In [5]:
##########################
# Low-level benchmarking #
##########################
using LLVM
using LLVM.Interop

 """
    clobber()
 Force the compiler to flush pending writes to global memory.
Acts as an effective read/write barrier.
"""
@inline clobber() = @asmcall("", "~{memory}", true) 

"""
    escape(val)
 The `escape` function can be used to prevent a value or
expression from being optimized away by the compiler. This function is
intended to add little to no overhead.
See: https://youtu.be/nXaxk27zwlk?t=2441
"""
@inline escape(val::T) where T = @asmcall("", "X,~{memory}", true, Nothing, Tuple{T}, val)

escape

# A simple example: counting

In [6]:
function f(N)
    acc = 0
    for i in 1:N
        acc += 1
    end
    return acc
end

f (generic function with 1 method)

In [7]:
N = 100_000_000
result = @benchmark f($N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.692 ns (0.00% GC)
  median time:      1.701 ns (0.00% GC)
  mean time:        1.746 ns (0.00% GC)
  maximum time:     11.439 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

In [8]:
using Unitful

In [9]:
t = time(minimum(result)) * u"ns" # in ns
pFreq = N/t |> u"PHz"
t, pFreq

(1.692 ns, 59.1016548463357 PHz)

So we are doing 100 million additions in 1.2ns.
So our processor is operating at 70 PHz...

We wish...

What is going on?

In [10]:
@benchmark f($(10*N))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.692 ns (0.00% GC)
  median time:      1.701 ns (0.00% GC)
  mean time:        1.713 ns (0.00% GC)
  maximum time:     12.478 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

In [11]:
@code_lowered f(N)

CodeInfo(
[37m1 ─[39m       acc = 0
[37m│  [39m %2  = 1:N
[37m│  [39m       @_4 = Base.iterate(%2)
[37m│  [39m %4  = @_4 === nothing
[37m│  [39m %5  = Base.not_int(%4)
[37m└──[39m       goto #4 if not %5
[37m2 ┄[39m %7  = @_4
[37m│  [39m       i = Core.getfield(%7, 1)
[37m│  [39m %9  = Core.getfield(%7, 2)
[37m│  [39m       acc = acc + 1
[37m│  [39m       @_4 = Base.iterate(%2, %9)
[37m│  [39m %12 = @_4 === nothing
[37m│  [39m %13 = Base.not_int(%12)
[37m└──[39m       goto #4 if not %13
[37m3 ─[39m       goto #2
[37m4 ┄[39m       return acc
)

In [12]:
@code_typed optimize=false f(N)

CodeInfo(
[37m1 ─[39m       (acc = 0)[37m::Core.Compiler.Const(0, false)[39m
[37m│  [39m %2  = (1:N)[36m::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])[39m
[37m│  [39m       (@_4 = Base.iterate(%2))[37m::Union{Nothing, Tuple{Int64,Int64}}[39m
[37m│  [39m %4  = (@_4 === nothing)[36m::Bool[39m
[37m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[37m└──[39m       goto #4 if not %5
[37m2 ┄[39m %7  = @_4::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[37m│  [39m       (i = Core.getfield(%7, 1))[37m::Int64[39m
[37m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[37m│  [39m       (acc = acc + 1)[37m::Int64[39m
[37m│  [39m       (@_4 = Base.iterate(%2, %9))[37m::Union{Nothing, Tuple{Int64,Int64}}[39m
[37m│  [39m %12 = (@_4 === nothing)[36m::Bool[39m
[37m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[37m└──[39m       goto #4 if not %13
[37m3 ─[39m       goto #2
[37m4 ┄[39m       return acc
) => In

In [13]:
@code_typed optimize=true f(N)

CodeInfo(
[37m1 ──[39m %1  = Base.sle_int(1, N)[36m::Bool[39m
[37m│   [39m %2  = Base.ifelse(%1, N, 0)[36m::Int64[39m
[37m│   [39m %3  = Base.slt_int(%2, 1)[36m::Bool[39m
[37m└───[39m       goto #3 if not %3
[37m2 ──[39m       goto #4
[37m3 ──[39m       goto #4
[37m4 ┄─[39m %7  = φ (#2 => true, #3 => false)[36m::Bool[39m
[37m│   [39m %8  = φ (#3 => 1)[36m::Int64[39m
[37m│   [39m %9  = Base.not_int(%7)[36m::Bool[39m
[37m└───[39m       goto #10 if not %9
[37m5 ┄─[39m %11 = φ (#4 => 0, #9 => %13)[36m::Int64[39m
[37m│   [39m %12 = φ (#4 => %8, #9 => %19)[36m::Int64[39m
[37m│   [39m %13 = Base.add_int(%11, 1)[36m::Int64[39m
[37m│   [39m %14 = (%12 === %2)[36m::Bool[39m
[37m└───[39m       goto #7 if not %14
[37m6 ──[39m       goto #8
[37m7 ──[39m %17 = Base.add_int(%12, 1)[36m::Int64[39m
[37m└───[39m       goto #8
[37m8 ┄─[39m %19 = φ (#7 => %17)[36m::Int64[39m
[37m│   [39m %20 = φ (#6 => true, #7 => false)[36m::Bool[39m
[37m

In [14]:
@code_llvm optimize=false f(10)


;  @ In[6]:2 within `f'
define i64 @julia_f_17205(i64) {
top:
  %1 = call %jl_value_t*** @julia.ptls_states()
  %2 = bitcast %jl_value_t*** %1 to %jl_value_t addrspace(10)**
  %3 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %2, i64 4
  %4 = bitcast %jl_value_t addrspace(10)** %3 to i64**
  %5 = load i64*, i64** %4
;  @ In[6]:3 within `f'
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:275 within `Type'
; ││┌ @ range.jl:280 within `unitrange_last'
; │││┌ @ operators.jl:341 within `>='
; ││││┌ @ int.jl:424 within `<='
       %6 = icmp sle i64 1, %0
; │││└└
     %7 = zext i1 %6 to i8
     %8 = trunc i8 %7 to i1
     %9 = xor i1 %8, true
     %10 = select i1 %9, i64 0, i64 %0
; └└└
; ┌ @ range.jl:591 within `iterate'
; │┌ @ range.jl:475 within `isempty'
; ││┌ @ operators.jl:294 within `>'
; │││┌ @ int.jl:49 within `<'
      %11 = icmp slt i64 %10, 1
; │└└└
   %12 = zext i1 %11 to i8
   %13 = trunc i8 %12 to i1
   %14 = xor i1 %13, true
   br i1 %14, lab

In [15]:
@code_llvm optimize=true f(10)


;  @ In[6]:2 within `f'
define i64 @julia_f_17206(i64) {
top:
;  @ In[6]:3 within `f'
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:275 within `Type'
; ││┌ @ range.jl:280 within `unitrange_last'
; │││┌ @ operators.jl:341 within `>='
; ││││┌ @ int.jl:424 within `<='
       %1 = icmp sgt i64 %0, 0
; └└└└└
  %spec.select = select i1 %1, i64 %0, i64 0
;  @ In[6]:6 within `f'
  ret i64 %spec.select
}


In [16]:
@code_native f(10)

	.text
; ┌ @ In[6]:2 within `f'
	movq	%rdi, %rax
	sarq	$63, %rax
	andnq	%rdi, %rax, %rax
; │ @ In[6]:6 within `f'
	retq
	nopl	(%rax)
; └


# Conclusion

LLVM realised that our loop.

```julia
for i in 1:N
  acc += 1
end
```

Just ended up being $acc = 1 * N$

# Exercise

What happens with:

```julia
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```

and

```julia
function g(N)
    acc = 0
    for i in 1:N
        acc += 1.0
    end
    acc
end
```
    

In [17]:
function h(N)
    acc = 0.0
    for i in 1:N
        acc += 1.0
    end
    acc
end

h (generic function with 1 method)

In [18]:
@code_native h(10)

	.text
; ┌ @ In[17]:2 within `h'
	vxorpd	%xmm0, %xmm0, %xmm0
; │ @ In[17]:3 within `h'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
	testq	%rdi, %rdi
; │└└└└└
	jle	L42
	movabsq	$139788113587504, %rax  # imm = 0x7F22F4DAAD30
	vmovsd	(%rax), %xmm1           # xmm1 = mem[0],zero
	nopw	(%rax,%rax)
; │ @ In[17]:4 within `h'
; │┌ @ float.jl:395 within `+'
L32:
	vaddsd	%xmm1, %xmm0, %xmm0
; │└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	addq	$-1, %rdi
; │└└
	jne	L32
; │ @ In[17]:6 within `h'
L42:
	retq
	nopl	(%rax,%rax)
; └


In [19]:
function g(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
    end
    acc
end

g (generic function with 1 method)

In [20]:
@code_warntype g(Int, 10)

Variables
  #self#[36m::Core.Compiler.Const(g, false)[39m
  #unused#[36m::Core.Compiler.Const(Int64, false)[39m
  N[36m::Int64[39m
  acc[36m::Int64[39m
  @_5[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[36m::Int64[39m
[37m1 ─[39m       (acc = Main.zero($(Expr(:static_parameter, 1))))
[37m│  [39m %2  = (1:N)[36m::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])[39m
[37m│  [39m       (@_5 = Base.iterate(%2))
[37m│  [39m %4  = (@_5 === nothing)[36m::Bool[39m
[37m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[37m└──[39m       goto #4 if not %5
[37m2 ┄[39m %7  = @_5::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[37m│  [39m       (i = Core.getfield(%7, 1))
[37m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[37m│  [39m %10 = acc[36m::Int64[39m
[37m│  [39m %11 = Main.one($(Expr(:static_parameter, 1)))[36m::Core.Compiler.Const(1, false)[39m
[37m│  [39m       (acc = %1

In [21]:
function k(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
        clobber()
    end
    return acc
end

k (generic function with 1 method)

In [22]:
@code_native k(Float64, 10)

	.text
; ┌ @ In[21]:2 within `k'
	vxorpd	%xmm0, %xmm0, %xmm0
; │ @ In[21]:3 within `k'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
	testq	%rsi, %rsi
; │└└└└└
	jle	L42
	movabsq	$139788113599624, %rax  # imm = 0x7F22F4DADC88
	vmovsd	(%rax), %xmm1           # xmm1 = mem[0],zero
	nopw	(%rax,%rax)
; │ @ In[21]:4 within `k'
; │┌ @ float.jl:395 within `+'
L32:
	vaddsd	%xmm1, %xmm0, %xmm0
; │└
; │ @ In[21]:5 within `k'
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ base.jl:52 within `=='
	addq	$-1, %rsi
; │└└
	jne	L32
; │ @ In[21]:7 within `k'
L42:
	retq
	nopl	(%rax,%rax)
; └


In [23]:
@code_native k(Int64, 10)

	.text
; ┌ @ In[21]:3 within `k'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ In[21]:2 within `<='
	testq	%rsi, %rsi
; │└└└└└
	jle	L26
	movq	%rsi, %rax
	nopl	(%rax,%rax)
; │ @ In[21]:5 within `k'
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ base.jl:52 within `=='
L16:
	addq	$-1, %rax
; │└└
	jne	L16
; │ @ In[21]:7 within `k'
	movq	%rsi, %rax
	retq
L26:
	xorl	%esi, %esi
; │ @ In[21]:7 within `k'
	movq	%rsi, %rax
	retq
; └


In [24]:
function m(::Type{T}, N) where T
    acc = zero(T)
    for i in 1:N
        acc += one(T)
        escape(acc)
    end
    return acc
end

m (generic function with 1 method)

In [25]:
@code_native m(Int64, 30)

	.text
; ┌ @ In[24]:3 within `m'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ In[24]:2 within `<='
	testq	%rsi, %rsi
; │└└└└└
	jle	L38
	movq	%rsi, %rax
	negq	%rax
	movl	$1, %ecx
; │ @ In[24]:5 within `m'
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ base.jl:52 within `=='
L16:
	leaq	(%rax,%rcx), %rdx
	addq	$1, %rdx
; ││└
; ││ @ range.jl:596 within `iterate'
; ││┌ @ int.jl:53 within `+'
	addq	$1, %rcx
; ││└
; ││ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	cmpq	$1, %rdx
; │└└
	jne	L16
; │ @ In[24]:7 within `m'
	movq	%rsi, %rax
	retq
L38:
	xorl	%esi, %esi
; │ @ In[24]:7 within `m'
	movq	%rsi, %rax
	retq
	nopl	(%rax)
; └


In [26]:
result2 = @benchmark m($Int64, $N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.495 ms (0.00% GC)
  median time:      32.678 ms (0.00% GC)
  mean time:        33.166 ms (0.00% GC)
  maximum time:     56.087 ms (0.00% GC)
  --------------
  samples:          151
  evals/sample:     1

In [27]:
@benchmark m($Int64, $(N*10))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     285.817 ms (0.00% GC)
  median time:      294.762 ms (0.00% GC)
  mean time:        297.591 ms (0.00% GC)
  maximum time:     331.349 ms (0.00% GC)
  --------------
  samples:          17
  evals/sample:     1

In [28]:
t = time(minimum(result2)) * u"ns" # in ns
pFreq = N/t |> u"GHz"

3.279203735878765 GHz

Sanity restored: 3.8 GHz is much closer to the frequency of my actual processor 

Note: Benchmarking is hard, careful evalutaion of *what* you are trying to benchmark.

- If we were just interesting in how fast `f(N)` was we would have been fine with our first measurement
- But we were interested in the speed of addition as a proxy of perfromance
- Integer math on a computer is associative, Floating-Point math is not.

In [29]:
@benchmark h($N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     118.483 ms (0.00% GC)
  median time:      121.931 ms (0.00% GC)
  mean time:        123.227 ms (0.00% GC)
  maximum time:     130.173 ms (0.00% GC)
  --------------
  samples:          41
  evals/sample:     1

In [30]:
function l(N)
    acc = 0.0
    @simd for i in 1:N
        acc += 1.0
    end
    acc
end

l (generic function with 1 method)

In [31]:
@benchmark l($N)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.377 ms (0.00% GC)
  median time:      6.773 ms (0.00% GC)
  mean time:        6.750 ms (0.00% GC)
  maximum time:     7.683 ms (0.00% GC)
  --------------
  samples:          741
  evals/sample:     1

# Performance annotiations in Julia

- https://docs.julialang.org/en/v1/manual/performance-tips/
- Julia does bounds checking by default `ones(10)[11]` is an error
- `@inbounds` Turns of bounds-checking locally
- `@fastmath` Turns of strict IEEE749 locally -- be very careful this might not to what you want
- `@simd` and `@simd ivdep` stronger gurantuees to encourage LLVM to use SIMD operations

In [32]:
?@simd

```
@simd
```

Annotate a `for` loop to allow the compiler to take extra liberties to allow loop re-ordering

!!! warning
    This feature is experimental and could change or disappear in future versions of Julia. Incorrect use of the `@simd` macro may cause unexpected results.


The object iterated over in a `@simd for` loop should be a one-dimensional range. By using `@simd`, you are asserting several properties of the loop:

  * It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
  * Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In many cases, Julia is able to automatically vectorize inner for loops without the use of `@simd`. Using `@simd` gives the compiler a little extra leeway to make it possible in more situations. In either case, your inner loop should have the following properties to allow vectorization:

  * The loop must be an innermost loop
  * The loop body must be straight-line code. Therefore, [`@inbounds`](@ref) is   currently needed for all array accesses. The compiler can sometimes turn   short `&&`, `||`, and `?:` expressions into straight-line code if it is safe   to evaluate all operands unconditionally. Consider using the [`ifelse`](@ref)   function instead of `?:` in the loop if it is safe to do so.
  * Accesses must have a stride pattern and cannot be "gathers" (random-index   reads) or "scatters" (random-index writes).
  * The stride should be unit stride.

!!! note
    The `@simd` does not assert by default that the loop is completely free of loop-carried memory dependencies, which is an assumption that can easily be violated in generic code. If you are writing non-generic code, you can use `@simd ivdep for ... end` to also assert that:


  * There exists no loop-carried memory dependencies
  * No iteration ever waits on a previous iteration to make forward progress.


In [33]:
@code_llvm l(10)


;  @ In[30]:2 within `l'
define double @julia_l_17667(i64) {
top:
;  @ In[30]:3 within `l'
; ┌ @ simdloop.jl:69 within `macro expansion'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
        %1 = icmp sgt i64 %0, 0
; ││││└└
      %2 = select i1 %1, i64 %0, i64 0
; │└└└
; │ @ simdloop.jl:71 within `macro expansion'
; │┌ @ simdloop.jl:51 within `simd_inner_length'
; ││┌ @ range.jl:541 within `length'
; │││┌ @ checked.jl:222 within `checked_sub'
; ││││┌ @ checked.jl:194 within `sub_with_overflow'
       %3 = add nsw i64 %2, -1
; │││└└
; │││┌ @ checked.jl:165 within `checked_add'
; ││││┌ @ checked.jl:132 within `add_with_overflow'
       %4 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %3, i64 1)
       %5 = extractvalue { i64, i1 } %4, 1
; ││││└
; ││││ @ checked.jl:166 within `checked_add'
      br i1 %5, label %L16, label %L21

L16:          

# Let's revisit our example from earlier!

Slightly more complicated function!

- What is wrong with `mysum3(ones(10_000))`

In [34]:
function mysum3(data::Vector{T}) where T<:Number
  acc = zero(T)
  @simd for x in data
      acc += x
  end
  return acc
end

mysum3 (generic function with 1 method)

In [35]:
@code_warntype mysum3(zeros(3))

Variables
  #self#[36m::Core.Compiler.Const(mysum3, false)[39m
  data[36m::Array{Float64,1}[39m
  acc[36m::Float64[39m
  @_4[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  r#421[36m::Array{Float64,1}[39m
  i#422[36m::Int64[39m
  n#423[36m::Int64[39m
  i#424[36m::Int64[39m
  x[36m::Float64[39m

Body[36m::Float64[39m
[37m1 ─[39m       (acc = Main.zero($(Expr(:static_parameter, 1))))
[37m│  [39m       (r#421 = data)
[37m│  [39m %3  = Base.simd_outer_range[36m::Core.Compiler.Const(Base.SimdLoop.simd_outer_range, false)[39m
[37m│  [39m %4  = (%3)(r#421)[36m::Core.Compiler.Const(0:0, false)[39m
[37m│  [39m       (@_4 = Base.iterate(%4))
[37m│  [39m %6  = (@_4::Core.Compiler.Const((0, 0), false) === nothing)[36m::Core.Compiler.Const(false, false)[39m
[37m│  [39m %7  = Base.not_int(%6)[36m::Core.Compiler.Const(true, false)[39m
[37m└──[39m       goto #8 if not %7
[37m2 ─[39m %9  = @_4::Core.Compiler.Const((0, 0), false)[36m::Core.Compiler.C

# Task

- Write, a fast and generic `sum` implementation.

## Using the profiler

1. `using Profile`
2. `@profile mysum()`
3. `Profile.clear()` -- reset the profile
4. `Proile.print()` simple display of profile data
5. Use ProfileView.jl or PProf.jl to analyse your data better
