# Exercise: SIMD Data Dependency

Consider the following loop involving four vectors `a`,`b`,`c`, and `d`:

In [1]:
const LOOP_ITERATIONS = 8192
const N = LOOP_ITERATIONS + 2

function loop_naive!(a, b, c, d)
    @inbounds for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + b[i]
        b[i+2] = c[i] + d[i]
    end
end

a = rand(Float32, N)
b = rand(Float32, N)
c = rand(Float32, N)
d = rand(Float32, N)

loop_naive!(a,b,c,d)

This loop is hard to auto-vectorize because it has a **data-dependency**: we're reading and writing elements of the vector `b`.

**Task 1**: Check the native code produced for `loop_naive!(a,b,c,d)` and convince yourself that the Julia compiler hasn't vectorized this code. (There shouldn't be any usage of `ymm` or `zmm` registers etc.)

In [2]:
@code_native loop_naive!(a,b,c,d)

	[0m.text
	[0m.file	[0m"loop_naive!"
	[0m.globl	[0m"japi1_loop_naive!_769"         [90m# -- Begin function japi1_loop_naive!_769[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0m"japi1_loop_naive!_769"[0m,[0m@function
[91m"japi1_loop_naive!_769":[39m                [90m# @"japi1_loop_naive!_769"[39m
[90m; ┌ @ In[1]:4 within `loop_naive!`[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mmov[22m[39m	[95mqword[39m [95mptr[39m [33m[[39m[0mrbp [0m- [33m56[39m[33m][39m[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrax[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi[33m][39m
	[96m[1mmov[22m[39m	[0mrcx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi [0m+ [33m8[39m[33m][3


**Task 2**: Implement the same loop in `loop_naive_simd!` and try to force SIMD-vectorization with the corresponding performance macro. (You shall keep the `@inbounds` as well.)

In [3]:
function loop_naive_simd!(a, b, c, d)
    @inbounds @simd for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + b[i]
        b[i+2] = c[i] + d[i]
    end
end

loop_naive_simd! (generic function with 1 method)

**Task 3**: Check the native code of `loop_naive_simd!`. Has the code improved? The learning here is that just putting `@simd` in front of a loop and hoping for the best isn't a particularly good strategy 😉

In [4]:
@code_native loop_naive_simd!(a,b,c,d)

	[0m.text
	[0m.file	[0m"loop_naive_simd!"
	[0m.globl	[0m"japi1_loop_naive_simd!_878"    [90m# -- Begin function japi1_loop_naive_simd!_878[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0m"japi1_loop_naive_simd!_878"[0m,[0m@function
[91m"japi1_loop_naive_simd!_878":[39m           [90m# @"japi1_loop_naive_simd!_878"[39m
[90m; ┌ @ In[3]:1 within `loop_naive_simd!`[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mmov[22m[39m	[95mqword[39m [95mptr[39m [33m[[39m[0mrbp [0m- [33m56[39m[33m][39m[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrax[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi[33m][39m
	[96m[1mmov[22m[39m	[0mrcx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrsi 

**Task 4**: Benchmark and compare the variants. What do you observe?


In [7]:
using BenchmarkTools
@btime loop_naive!($a,$b,$c,$d);
@btime loop_naive_simd!($a,$b,$c,$d);

  6.098 μs (0 allocations: 0 bytes)
  6.098 μs (0 allocations: 0 bytes)



**Task 5**: Take a closer look at the loop. Can you "resolve" the data-dependency issue by splitting up the loop into two separate loops? Implement this improved version in the functions below. Use `@simd` for the loops in the second function. (Again, keep `@inbounds` for all loops in both functions.)

In [19]:
function loop_opt!(a, b, c, d)
    bcpy = copy(b)
    @inbounds for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + bcpy[i]
        b[i+2] = c[i] + d[i]
    end
end

function loop_opt_simd!(a, b, c, d)
    bcpy = deepcopy(b)
    @inbounds @simd for i in 1:LOOP_ITERATIONS
        a[i] = a[i] + bcpy[i]
        b[i+2] = c[i] + d[i]
    end
end

loop_opt_simd! (generic function with 1 method)

**Task 6**: Benchmark those new variants as well.
  * How do they compare to each other?
  * Did the SIMD performance macro help? (Hint: It shouldn't.)
  * How does the performance compare to the unoptimized variants above?

In [20]:
@btime loop_opt!($a,$b,$c,$d);
@btime loop_opt_simd!($a,$b,$c,$d);

  5.503 μs (2 allocations: 32.11 KiB)
  5.635 μs (4 allocations: 32.44 KiB)



**Task 7**: Check the native code of e.g. `loop_opt_simd!`. Did it vectorize properly? (Look e.g. for `ymm` and `zmm` registers as well as a block of `vaddps` instructions. Note though, that this is system-dependent.)

In [21]:
@code_native loop_opt_simd!(a,b,c,d)

	[0m.text
	[0m.file	[0m"loop_opt_simd!"
	[0m.globl	[0m"japi1_loop_opt_simd!_1481"     [90m# -- Begin function japi1_loop_opt_simd!_1481[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0m"japi1_loop_opt_simd!_1481"[0m,[0m@function
[91m"japi1_loop_opt_simd!_1481":[39m            [90m# @"japi1_loop_opt_simd!_1481"[39m
[90m; ┌ @ In[19]:9 within `loop_opt_simd!`[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1msub[22m[39m	[0mrsp[0m, [33m56[39m
	[96m[1mvxorps[22m[39m	[0mxmm0[0m, [0mxmm0[0m, [0mxmm0
	[96m[1mvmovaps[22m[39m	[95mxmmword[39m [95mptr[39m [33m[[39m[0mrbp [0m- [33m80[39m[33m][39m[0m, [0mxmm0
	[96m[1mmov[22m[39m	[95mqword[39m [95mptr[39m [33m[[39m[0m