# CUDA programming in Julia

See [GPU programming in Julia - What, Why and How | Tim Besard | Julia User Group Munich - YouTube](https://www.youtube.com/watch?v=0JiQoN_UD64)

## RMSE

$RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (A_i - B_i)^2}$

In [1]:
rmse(A, B) = sqrt(sum((A .- B) .^ 2) / length(A))

rmse (generic function with 1 method)

In [2]:
N = 2048
T = Float32

Float32

In [3]:
import Random
Random.seed!(1234)

Random.TaskLocalRNG()

In [4]:
A = rand(T, N, N);
B = rand(T, N, N);

In [5]:
using BenchmarkTools

In [6]:
@btime rmse(A, B)

  5.005 ms (5 allocations: 16.00 MiB)


0.4082928f0

## Using CUDA (broadcasting)

In [7]:
using CUDA

In [8]:
CUDA.allowscalar(false)

In [9]:
dA = CuArray(A);
dB = CuArray(B);

In [10]:
@btime rmse(dA, dB)

  397.398 μs (114 allocations: 6.25 KiB)


0.40829277f0

## Using CUDA (kernels)

In [11]:
function rmse_cuda(A, B)
    # Validate inputs.
    @assert size(A) == size(B)
    # Allocate output
    C = similar(A, 1)
    C .= 0
    # Launch grid.
    threads = 512
    blocks = div(length(A), threads, RoundUp) # cld(length(A), threads)
    @cuda threads = threads blocks = blocks rmse_kernel(C, A, B)
    # Fetch result.
    return sqrt(C[] / length(A))
end

rmse_cuda (generic function with 1 method)

In [14]:
function rmse_kernel(C, A, B)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x

    if i <= length(A)
        a = A[i]
        b = B[i]
        CUDA.@atomic C[] += (a - b)^2
    end

    return
end

rmse_kernel (generic function with 1 method)

In [16]:
CUDA.allowscalar(true)
@btime rmse_cuda(dA, dB)

  6.844 ms (35 allocations: 1.98 KiB)


0.4074073f0

## Minimize global memory traffic