## A Few Gotchas and How to Handle Them

This notebook talks about the several tricky situations in Julia and the ways around them.

Original author: Christopher Rackauckas

### Why is Julia Fast ?

Type specialization + Code compilation

In [5]:
function f(a,b)
  return 2a+b
end

f (generic function with 1 method)

In [6]:
@code_native f(1,2)

	.section	__TEXT,__text,regular,pure_instructions
Filename: In[5]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 2
	leaq	(%rsi,%rdi,2), %rax
	popq	%rbp
	retq
	nopw	(%rax,%rax)


In [7]:
@code_native f(1.,2.)

	.section	__TEXT,__text,regular,pure_instructions
Filename: In[5]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 2
	addsd	%xmm0, %xmm0
	addsd	%xmm1, %xmm0
	popq	%rbp
	retq
	nop


### Gotcha : REPL "Globals" Have Bad Performance

Globals in Julia have awful performance. Not using globals is the first fact in the Performance Tips. However, what newcomers don't realize is that the REPL is the global scope. To see why, recall that Julia has nested scopes. For example, if you have a function inside of a function, then the inner function has all of the variables of the outer function.

In [1]:
function test(x)
    y = x+2
    function test2()
        y+3
    end
    test2()
end

test (generic function with 1 method)

In test2, y is known because it is defined in test. This will all work to give something performant if y is type-stable since test2 could then assume that y is always an integer. But now look at what happens at the highest scope (and thus effectively the global scope):

In [2]:
a = 3
function badidea()
    a + 2
end
a = 3.0

3.0

Because no dispatch is used to specialize badidea, and we can change the type of a at any time, and therefore badidea cannot add optimizations when compiling since the type of a is unknown during compile time. 

In [8]:
a=2.0; a=3.0
function linearcombo()
  return 2a+b
end

linearcombo (generic function with 1 method)

In [12]:
@code_warntype linearcombo()

Variables:
  #self#::#linearcombo

Body:
  begin 
      return ((2 * Main.a)::ANY + Main.b)::ANY
  end::ANY


In [13]:
@code_native linearcombo()

	.section	__TEXT,__text,regular,pure_instructions
Filename: In[8]
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$56, %rsp
	movabsq	$4588825784, %rbx       ## imm = 0x11183ECB8
	movabsq	$jl_get_ptls_states_fast, %rax
	callq	*%rax
	movq	%rax, %r14
	leaq	-64(%rbp), %r13
	movq	$0, -80(%rbp)
	movq	$0, -48(%rbp)
	movq	$0, -56(%rbp)
	movq	$0, -64(%rbp)
	movq	$10, -96(%rbp)
	movq	(%r14), %rax
	movq	%rax, -88(%rbp)
	leaq	-96(%rbp), %rax
	movq	%rax, (%r14)
	movq	$0, -72(%rbp)
Source line: 3
	movq	266505840(%rbx), %rax
	movq	%rax, -48(%rbp)
	leaq	183491360(%rbx), %rax
	movq	%rax, -64(%rbp)
	leaq	172987432(%rbx), %rax
	movq	%rax, -56(%rbp)
	movabsq	$jl_apply_generic, %r15
	movl	$3, %esi
	movq	%r13, %rdi
	callq	*%r15
	movq	%rax, %r12
	movq	%r12, -80(%rbp)
	movabsq	$4753320288, %rax       ## imm = 0x11B51E960
	movq	(%rax), %rax
	testq	%rax, %rax
	jne	L222
	leaq	173102488(%rbx), %rdi
	movabsq	$jl_get_binding_or_error, %rax
	movq	%rbx, %rsi
	callq	*%rax

However, Julia allows us to specify variables as *constant*. This means, that, we are committing to maintaining that type. If you change it, you will get an error.

In [3]:
const aconst = 3
function notasbadidea()
    aconst + 2
end
aconst = 4 # Works
aconst = 3.0 # Fails



LoadError: LoadError: invalid redefinition of constant aconst
while loading In[3], in expression starting on line 6

The best way to avoid this is to declare variables within functions.

In [4]:
function timetest()
    a = 3.0
    @time for i = 1:4
        a += i
    end
end
timetest() # First time compiles
timetest()

  0.000000 seconds
  0.000000 seconds


This is a very easy problem to fall for: don't benchmark or time things in the REPL's global scope. Always wrap things in a function or declare them as const. There is a developer thread to make the global performance less awful but, given the information from this notebook, you can already see that it will never be "not awful", it will just be "less awful".

### Gotcha : Type-Instabilities

What happens when your types can change?

If you guessed "well, you can't really specialize the compiled code in that case either", then you are correct. This kind of problem is known as a type-instability. These can show up in many different ways, but one common example is that you initialize a value in a way that is easy, but not necessarily that type that it should be. For example, let's look at:

In [6]:
function g()
    x = 1
    for i = 1:10
        x = x/2
    end
    return x
end



g (generic function with 1 method)

Notice that "1/2" is a floating point number in Julia. Therefore it we started with "x=1", it will change types from an integer to a floating point number, and thus the function has to compile the inner loop as though it can be either type. If we instead had the function:

In [7]:
function h()
    x = 1.0
    for i = 1:10
        x = x/2
    end
    return x
end

h (generic function with 1 method)

then the whole function can optimally compile knowing x will stay a floating point number (this ability for the compiler to judge types is known as type inference). We can check the compiled code to see the difference:

Versus:

Notice how many fewer computational steps are required to compute the same value!

### How to Find and Deal with Type-Instabilities

At this point you might ask, "well, why not just use C so you don't have to try and find these instabilities?" The answer is:

1. They are easy to find
2. They can be useful
3. You can handle necessary instabilities with function barriers

### How to Find Type-Instabilities

Julia gives you the macro @code_warntype to show you where type instabilities are. For example, if we use this on the "g" function we created:

In [8]:
@code_warntype g()

Variables:
  #self#::#g
  x::ANY
  #temp#@_3::Int64
  i::Int64
  #temp#@_5::LambdaInfo
  #temp#@_6::Float64

Body:
  begin 
      x::ANY = 1 # line 3:
      SSAValue(2) = (Base.select_value)((Base.sle_int)(1,10)::Bool,10,(Base.box)(Int64,(Base.sub_int)(1,1)))::Int64
      #temp#@_3::Int64 = 1
      5: 
      unless (Base.box)(Base.Bool,(Base.not_int)((#temp#@_3::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(2),1)))::Bool)) goto 30
      SSAValue(3) = #temp#@_3::Int64
      SSAValue(4) = (Base.box)(Int64,(Base.add_int)(#temp#@_3::Int64,1))
      i::Int64 = SSAValue(3)
      #temp#@_3::Int64 = SSAValue(4) # line 4:
      unless (Core.isa)(x::UNION{FLOAT64,INT64},Float64)::ANY goto 15
      #temp#@_5::LambdaInfo = LambdaInfo for /(::Float64, ::Int64)
      goto 24
      15: 
      unless (Core.isa)(x::UNION{FLOAT64,INT64},Int64)::ANY goto 19
      #temp#@_5::LambdaInfo = LambdaInfo for /(::Int64, ::Int64)
      goto 24
      19: 
      goto 21
      21: 
      #temp#@_6::Float64 = (x

Notice that it tells us at the top that the type of x is "ANY". It will capitalize any type which is not inferred as a "strict type", i.e. it is an abstract type which needs to be boxed/checked at each step. We see that at the end we return x as a "UNION{FLOAT64,INT64}", which is another non-strict type. This tells us that the type of x changed, causing the difficulty. If we instead look at the @code_warntype for h, we get all strict types:

In [9]:
@code_warntype h()

Variables:
  #self#::#h
  x::Float64
  #temp#::Int64
  i::Int64

Body:
  begin 
      x::Float64 = 1.0 # line 3:
      SSAValue(2) = (Base.select_value)((Base.sle_int)(1,10)::Bool,10,(Base.box)(Int64,(Base.sub_int)(1,1)))::Int64
      #temp#::Int64 = 1
      5: 
      unless (Base.box)(Base.Bool,(Base.not_int)((#temp#::Int64 === (Base.box)(Int64,(Base.add_int)(SSAValue(2),1)))::Bool)) goto 15
      SSAValue(3) = #temp#::Int64
      SSAValue(4) = (Base.box)(Int64,(Base.add_int)(#temp#::Int64,1))
      i::Int64 = SSAValue(3)
      #temp#::Int64 = SSAValue(4) # line 4:
      x::Float64 = (Base.box)(Base.Float64,(Base.div_float)(x::Float64,(Base.box)(Float64,(Base.sitofp)(Float64,2))))
      13: 
      goto 5
      15:  # line 6:
      return x::Float64
  end::Float64


Indicating that this function is type stable and will compile to essentially optimal C code. Thus type-instabilities are not hard to find. What's harder is to find the right design.

### Why Allow Type-Instabilities?

This is an age old question which has lead to dynamically-typed languages dominating the scripting language playing field. The idea is that, in many cases you want to make a tradeoff between performance and robustness. For example, you may want to read a table from a webpage which has numbers all mixed together with integers and floating point numbers. In Julia, you can write your function such that if they were all integers, it will compile well, and if they were all floating point numbers, it will also compile well. And if they're mixed? It will still work. That's the flexibility/convenience we know and love from a language like Python/R. But Julia will explicitly tell you (via @code_warntype) when you are making this performance tradeoff.

### How to Handle Type-Instabilities

There are a few ways to handle type-instabilities. First of all, if you like something like C/Fortran where your types are declared and can't change (thus ensuring type-stability), you can do that in Julia. You can declare your types in a function with the following syntax:

This makes "a" an 64-bit integer, and if future code tries to change it, an error will be thrown (or a proper conversion will be done. But since the conversion will not automatically round, it will most likely throw errors). Sprinkle these around your code and you will get type stability the C/Fortran way.
A less heavy handed way to handle this is with type-assertions. This is where you put the same syntax on the other side of the equals sign. For example:

This says "calculate b/c, and make sure that the output is a Float64. If it's not, try to do an auto-conversion. If it can't easily convert, throw an error". Putting these around will help you make sure you know the types which are involved.

However, there are cases where type instabilities are necessary. For example, let's say you want to have a robust code, but the user gives you something crazy like:

In [14]:
arr = Vector{Union{Int64,Float64}}(4)
arr[1]=4
arr[2]=2.0
arr[3]=3.2
arr[4]=1

1

In [1]:
arr = zeros(Union{Int64, Float64}, 10^5)
arr[rand(1:10^5, 5000)] = 1.0

1.0

which is a 10^5-element array of both integers and floating point numbers. The actual element type for the array is "Union{Int64,Float64}" which we saw before was a non-strict type which can lead to issues. The compiler only knows that each value can be either an integer or a floating point number, but not which element is which type.

In [2]:
function foo{T,N}(array::Array{T,N})
  for i in eachindex(array)
    val = array[i]
    x = [val+i+j for i = 1:10,j = 1:10]
    svdvals(x)
  end
end

foo (generic function with 1 method)

In [3]:
function inner_foo{T<:Number}(val::T)
  x = [val+i+j for i = 1:10,j = 1:10]
    svdvals(x)
end
 
function foo2{T,N}(array::Array{T,N})
  for i in eachindex(array)
    inner_foo(array[i])
  end
end

foo2 (generic function with 1 method)

In [11]:
using BenchmarkTools

INFO: Recompiling stale cache file /Applications/JuliaPro-0.5.0.2.app/Contents/Resources/pkgs-0.5.0.2/lib/v0.5/HDF5.ji for module HDF5.
INFO: Recompiling stale cache file /Applications/JuliaPro-0.5.0.2.app/Contents/Resources/pkgs-0.5.0.2/lib/v0.5/JLD.ji for module JLD.


In [12]:
@benchmark foo(arr)

BenchmarkTools.Trial: 
  memory estimate:  1.26 gb
  allocs estimate:  3471200
  --------------
  minimum time:     2.472 s (4.39% GC)
  median time:      2.660 s (4.97% GC)
  mean time:        2.660 s (4.97% GC)
  maximum time:     2.848 s (5.48% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [13]:
@benchmark foo2(arr)

BenchmarkTools.Trial: 
  memory estimate:  1.22 gb
  allocs estimate:  1900000
  --------------
  minimum time:     1.643 s (6.68% GC)
  median time:      1.648 s (6.79% GC)
  mean time:        1.649 s (6.80% GC)
  maximum time:     1.658 s (6.78% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

### Gotcha : How Expressions Break Up

In Julia, there are many cases where expressions will continue if they are not finished. For this reason line-continuation operators are not necessary: Julia will just read until the expression is finished.

Easy rule, right? Just make sure you remember how functions finish. For example:

In [11]:
a = 2 + 3 + 4 + 5 + 6 + 7
   +8 + 9 + 10+ 11+ 12+ 13

63

looks like it will evaluate to 90, but instead it gives 63.

In [12]:
a = 2 + 3 + 4 + 5 + 6 + 7 +
    8 + 9 + 10+ 11+ 12+ 13

90

This will make a=90 as we wanted. This might trip you up the first time, but then you'll get used to it.

The more difficult issue deals with array definitions. For example:

In [13]:
x = rand(2,2)
a = [cos(2*pi.*x[:,1]).*cos(2*pi.*x[:,2])./(4*pi) -sin(2.*x[:,1]).*sin(2.*x[:,2])./(4)]
b = [cos(2*pi.*x[:,1]).*cos(2*pi.*x[:,2])./(4*pi) - sin(2.*x[:,1]).*sin(2.*x[:,2])./(4)]

1-element Array{Array{Float64,1},1}:
 [-0.187914,-0.259496]

at glance you might think a and b are the same, but they are not! The first will give you a (2,2) matrix, while the second is a (1-dimensional) vector of size 2. To see what the issue is, here's a simpler version:

In [14]:
a = [1 -2]
b = [1 - 2]

1-element Array{Int64,1}:
 -1

In the first case there are two numbers: "1" and "-2". In the second there is an expression: "1-2" (which is evaluated to give the array [-1]). This is because of the special syntax for array definitions. It's usually really lovely to write:

In [15]:
a = [1 2 3 -4
     2 -3 1 4]

2×4 Array{Int64,2}:
 1   2  3  -4
 2  -3  1   4

and get the 2x4 matrix that you'd expect. However, this is the tradeoff that occurs. However, this issue is also easy to avoid: instead of concatenating using a space (i.e. in a whitespace-sensitive manner), instead use the "hcat" function:

In [16]:
a = hcat(cos(2*pi.*x[:,1]).*cos(2*pi.*x[:,2])./(4*pi),-sin(2.*x[:,1]).*sin(2.*x[:,2])./(4))

2×2 Array{Float64,2}:
  0.0431695  -0.231084
 -0.0173947  -0.242101

Problem Solved!

### Gotcha #5: Views, Copy, and Deepcopy

One way in which Julia gets good performance is by working with "views". An "Array" is actually a "view" to the contiguous block of memory which is used to store the values. The "value" of the array is its pointer to the memory location (and its type information). This gives interesting (and useful) behavior. For example, if we run the following code:

In [17]:
a = [3;4;5]
b = a
b[1] = 1

1

then at the end we will have that "a" is the array "[1;4;5]", i.e. changing "b" changes "a". The reason is "b=a" set the value of "b" to the value of "a". Since the value of an array is its pointer to the memory location, what "b" actually gets is not a new array, rather it gets the pointer to the same memory location (which is why changing "b" changes "a").

This is very useful because it also allows you to keep the same array in many different forms. For example, we can have both a matrix and the vector form of the matrix using:

In [18]:
a = rand(2,2) # Makes a random 2x2 matrix
b = vec(a) # Makes a view to the 2x2 matrix which is a 1-dimensional array

4-element Array{Float64,1}:
 0.461209
 0.79796 
 0.765534
 0.489954

Now "b" is a vector, but changing "b" still changes "a", where "b" is indexed by reading down the columns. Notice that this whole time, no arrays have been copied, and therefore these operations have been excessively cheap (meaning, there's no reason to avoid them in performance sensitive code).

Now some details. Notice that the syntax for slicing an array will create a copy when on the right-hand side. For example:

In [19]:
c = a[1:2,1]

2-element Array{Float64,1}:
 0.461209
 0.79796 

will create a new array, and point "c" to that new array (thus changing "c" won't change "a"). This can be necessary behavior, however note that copying arrays is an expensive operation that should be avoided whenever possible. Thus we would instead create more complicated views using:

In [20]:
d = @view a[1:2,1]
e = view(a,1:2,1)

2-element SubArray{Float64,1,Array{Float64,2},Tuple{UnitRange{Int64},Int64},true}:
 0.461209
 0.79796 

Both "d" and "e" are the same thing, and changing either "d" or "e" will change "a" because both will not copy the array, just make a new variable which is a Vector that only points to the first column of "a". (Another function which creates views is "reshape" which lets you reshape an array.)

If this syntax is on the left-hand side, then it's a view. For example:

In [21]:
a[1:2,1] = [1;2]

2-element Array{Int64,1}:
 1
 2

will change "a" because, on the left-hand side, "a[1:2,1]" is the same as "view(a,1:2,1)" which points to the same memory as "a".

What if we need to make copies? Then we can use the copy function:

In [22]:
b = copy(a)

2×2 Array{Float64,2}:
 1.0  0.765534
 2.0  0.489954

Now since "b" is a copy of "a" and not a view, changing "b" will not change "a". If we had already defined "a", there's a handy in-place copy "copy!(b,a)" which will essentially loop through and write the values of "a" to the locations of "a" (but this requires that "b" is already defined and is the right size).

But now let's make a slightly more complicated array. For example, let's make a "Vector{Vector}":

In [23]:
a = Vector{Vector{Float64}}(2)
a[1] = [1;2;3]
a[2] = [4;5;6]

3-element Array{Int64,1}:
 4
 5
 6

Each element of "a" is a vector. What happens when we copy a?

In [24]:
b = copy(a)
b[1][1] = 10

10

Notice that this will change a[1][1] to 10 as well! Why did this happen? What happened is we used "copy" to copy the values of "a". But the values of "a" were arrays, so we copied the pointers to memory locations over to "b", so "b" actually points to the same arrays. To fix this, we instead use "deepcopy":

In [25]:
b = deepcopy(a)

2-element Array{Array{Float64,1},1}:
 [10.0,2.0,3.0]
 [4.0,5.0,6.0] 

This recursively calls copy in such a manner that we avoid this issue. Again, the rules of Julia are very simple and there's no magic, but sometimes you need to pay closer attention.

In [1]:
using BenchmarkTools
const x = rand(1000, 1000, 10);
const y = zeros(1000, 1000);

In [2]:
function bench(a::Matrix{Float64}, b::Matrix{Float64})
    for i = 1:10
        b += a[:,:,i]
    end
end
@benchmark bench(x, y)

BenchmarkTools.Trial: 
  memory estimate:  152.59 mb
  allocs estimate:  60
  --------------
  minimum time:     70.534 ms (24.12% GC)
  median time:      79.008 ms (23.94% GC)
  mean time:        79.675 ms (23.54% GC)
  maximum time:     89.337 ms (25.42% GC)
  --------------
  samples:          63
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [4]:
function bench2(a::Matrix{Float64}, b::Matrix{Float64})
    for i = 1:10
        b += view(a, :, :, i)
    end
end
@benchmark bench2(x, y)



BenchmarkTools.Trial: 
  memory estimate:  76.30 mb
  allocs estimate:  40
  --------------
  minimum time:     43.617 ms (20.86% GC)
  median time:      47.300 ms (22.05% GC)
  mean time:        48.197 ms (22.50% GC)
  maximum time:     75.847 ms (20.19% GC)
  --------------
  samples:          104
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [5]:
using BenchmarkTools
const c = rand(1000, 1000, 3);
const d = [-1 0 1; 1 0 1; 0 -1 0];

In [15]:
function stencil(c::Matrix{Float64}, d::Matrix{Float64})
    for k = 1:3
        for j = 2:size(c, 2)-1
            for i = 2:size(c, 1)-1
                c[j-1:j+1, i-1:i+1, k] = c[j-1:j+1, i-1:i+1, k] .* d
            end
        end
    end
    c
end
@benchmark stencil(c, d)



BenchmarkTools.Trial: 
  memory estimate:  2.45 gb
  allocs estimate:  59760240
  --------------
  minimum time:     4.478 s (5.02% GC)
  median time:      4.581 s (5.20% GC)
  mean time:        4.581 s (5.20% GC)
  maximum time:     4.684 s (5.37% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In [16]:
function stencilview(c::Matrix{Float64}, d::Matrix{Float64})
    for k = 1:3
        for j = 2:size(c, 2)-1
            for i = 2:size(c, 1)-1
                c[j-1:j+1, i-1:i+1, k] = view(c, j-1:j+1, i-1:i+1, k) .* d
            end
        end
    end
    c
end
@benchmark stencilview(c, d)



BenchmarkTools.Trial: 
  memory estimate:  1.96 gb
  allocs estimate:  50796204
  --------------
  minimum time:     4.021 s (5.06% GC)
  median time:      4.385 s (5.64% GC)
  mean time:        4.385 s (5.64% GC)
  maximum time:     4.749 s (6.13% GC)
  --------------
  samples:          2
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

### Gotcha : Temporary Allocations, Vectorization, and In-Place Functions

In MATLAB/Python/R, you're told to use vectorization. Vectorized codes give "temporary allocations" (i.e. they make middle-man arrays which aren't needed, and as noted before, array allocations are expensive and slow down your code!).

For this reason, you will want to fuse your vectorized operations and write them in-place in order to avoid allocations. What do I mean by in-place? An in-place function is one that updates a value instead of returning a value. If you're going to continually operate on an array, this will allow you to keep using the same array, instead of creating new arrays each iteration. For example, if you wrote:

In [29]:
function f()
    x = [1;5;6]
    for i = 1:10
        x = x + inner(x)
    end
    return x
end
function inner(x)
    return 2x
end

inner (generic function with 1 method)

then each time inner is called, it will create a new array to return "2x" in. Clearly we don't need to keep making new arrays. So instead we could have a cache array "y" which will hold the output like so:

In [30]:
function f_inplace()
    x = [1;5;6]
    y = Vector{Int64}(3)
    for i = 1:10
        inner(y,x)
        for i in 1:3
            x[i] = x[i] + y[i]
        end
        copy!(y,x)
    end
    return x
end

function inner!(y,x)
    for i=1:3
        y[i] = 2*x[i]
    end
    nothing
end

inner! (generic function with 1 method)

Let's dig into what's happening here. "inner!(y,x)" doesn't return anything, but it changes "y". Since "y" is an array, the value of "y" is the pointer to the actual array, and since in the function those values were changed, "inner!(y,x)" will have "silently" changed the values of "y". Functions which do this are called in-place. They are usually denoted with a "!", and usually change the first argument (this is just by convention). So there is no array allocation when "inner!(y,x)" is called.

In the same way, "copy!(y,x)" is an in-place function which writes the values of "x" to "y", updating it. As you can see, this means that every operation only changes the values of the arrays. Only two arrays are ever created: the initial array for "x" and the initial array for "y". The first function created a new array every since time "x + inner(x)" was called, and thus 11 arrays were created in the first function. Since array allocations are expensive, the second function will run faster than the first function.

It's nice that we can get fast, but the syntax bloated a little when we had to write out the loops. That's where loop-fusion comes in. In Julia v0.5, you can now use the "." symbol to vectorize any function (also known as broadcasting because it is actually calling the "broadcast" function). While it's cool that "f.(x)" is the same thing as applying "f" to each value of "x", what's cooler is that the loops fuse. If you just applied "f" to "x" and made a new array, then "x=x+f.(x)" would have a copy. However, what we can instead do is designate everything as array functions:

`x .= x .+ f.(x)`

In [22]:
function stencilunroll(c::Matrix{Float64}, d::Matrix{Float64})
    for k = 1:3
        for j = 2:size(c, 2)-1
            for i = 2:size(c, 1)-1
                for l = j-1:j+1, xx = 1:3
                    for m = i-1:i+1, yy = 1:3
                        c[l, m, k] = c[l, m, k] * d[xx,yy]
                    end
                end
            end
        end
    end
    c
end
@benchmark stencilunroll(c, d)



BenchmarkTools.Trial: 
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     351.918 ms (0.00% GC)
  median time:      372.079 ms (0.00% GC)
  mean time:        432.163 ms (0.00% GC)
  maximum time:     647.453 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

### Gotcha : Not Building the System Image for your Hardware

I was following all of these rules thinking I was a Julia champ, and then one day I realized that not every compiler optimization was actually happening. What was going on?

It turns out that the pre-built binaries that you get via the downloads off the Julia site are toned-down in their capabilities in order to be usable on a wider variety of machines. This includes the binaries you get from Linux when you do "apt-get install" or "yum install". Thus, unless you built Julia from source, your Julia is likely not as fast as it could be.

To customize the system image for your system's specific architecture, just run the following code in Julia:

```
include(joinpath(dirname(JULIA_HOME),"share","julia","build_sysimg.jl"));
build_sysimg(force=true)
```

If you're on Windows, you may need to run this code first:

`Pkg.add("WinRPM"); WinRPM.install("gcc")`

And on any system, you may need to have administrator privileges. This will take a little bit but when it's done, your install will be tuned to your system, giving you all of the optimizations available.