# Exercise: Performance Optimization

Optimize the following function.

In [4]:
function work!(A, B, v)
    @assert size(A) == size(B)
    val = zero(eltype(v))
    for i in 1:N
        val = mod(v[i],256)
        A[i,1:N] = B[i,1:N] * (sin(val) * sin(val) - cos(val) * cos(val))
    end
    return A
end

work! (generic function with 1 method)

The following data is **fixed** and **not supposed to be modified**!

In [5]:
# do not modify this cell!

using Random
Random.seed!(42)

N = 8000
B = rand(N,N)
v = rand(Int, N);

const result = work!(zeros(N,N), B, v);

# do not modify this cell!

You can compare against `A_result` to test your implementation(s):

In [6]:
using Test

@test work!(zeros(N,N), B, v) ≈ result

[32m[1mTest Passed[22m[39m

You can benchmark as follows:

In [7]:
using BenchmarkTools

@btime work!(A, $B, $v) setup=(A=zeros(N,N)); # or use @benchmark for more information

  984.368 ms (134979 allocations: 979.23 MiB)


## Your Optimizations

Your optimized variants go here!

**Hints** (hopefully):
* What is suboptimal about the code? What is it that you'd want to change (but can't directly)?
* Sometimes writing the code in a different way doesn't give direct speedups but enables further optimization.
* A ~30x speedup should be possible on most systems 😉

In [8]:
# Your variants go here...

function work!(A, B, v)
    @assert size(A) == size(B)
    val = zeros(Float64, size(v))

    @. val = - cos(2.0 * mod(v, 256))
    @. A = B * val

    return A
end

work! (generic function with 1 method)

In [9]:
@test work!(zeros(N,N), B, v) ≈ result

[32m[1mTest Passed[22m[39m

In [10]:
@btime work!(A, $B, $v) setup=(A=zeros(N,N)); # or use @benchmark for more information

  66.993 ms (2 allocations: 62.55 KiB)


In [70]:
using Symbolics
@variables x
simplify(sin(x) * sin(x) - cos(x) * cos(x))


-cos(2x)

## Bonus Question: Performance limit?

Look at your final optimized version of `work!`.

* In the limit of larger `A` and `B`, what is conceptually limiting the performance, the compute capability or memory transfer (i.e. reading and writing `A` and `B`)?

Let's try to quickly estimate the maximal memory bandwidth that a single-CPU core can achieve on the given computer:

In [11]:
using STREAMBenchmark
membw = memory_bandwidth(; nthreads=1).median / 1000 # memory bandwidth in GB /s

27.4223

For references, a single CPU-core in [Noctua 2](https://pc2.uni-paderborn.de/systems-and-services/noctua-2) can achieve a **maximal memory bandwidth of ~45 GB/s**.

* Given the maximal memory bandwidth, can you give a performance bound estimate, i.e. the minimal runtime that we could possibly hope to achieve?
  * Hint: how many flops are performed per iteration and how many bytes are transferred?
* How far off is your implementation from achieving the limit (in percent)?

In [12]:
# Your computation goes here...
flops = 1 # flops per iteration
traffic = 3*8 # bytes per iteration
I = flops / traffic # flops / byte

perf_bound = I*membw # GFLOPS
runtime_estimate = N^2 * 1e3 / (perf_bound * 1e9) # in ms

println("Performance bound: ", round(perf_bound, digits=2), " GFLOP/s")
println("Runtime estimate: ", round(runtime_estimate, digits=2), " ms")

Performance bound: 1.14 GFLOP/s
Runtime estimate: 56.01 ms


In [14]:
t_work5 = @belapsed work!(A, $B, $v) setup=(A=zeros(N,N))
ratio = runtime_estimate / (t_work5 * 1e3)
println("My best version achieves ", round(ratio * 100, digits=2), "% of the limit.")

My best version achieves 83.53% of the limit.
