# Julia -- CUDA Native programming
## Author: Dr. Rahul Remanan

* A test notebook for CUDAnative package in Julia.
* Based on the [Julia CUDAnative reduce example](https://github.com/JuliaGPU/CUDAnative.jl/tree/master/examples/reduce).
* CUDA is Nvidia's general purpose compute using GPU library.
* Using CUDAnative, high-performance GPU code using CUDA kernels can be written in Julia.
* CUDAnative sits at the same abstraction level as CUDA C.
* Highly integrated with Julia compiler and LLVM framework.


## Julia version check

In [1]:
versioninfo()

Julia Version 1.1.0-DEV.255
Commit 26b6a5811c (2018-09-13 21:44 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, broadwell)


## [Vector addition using GPU -- Using CUDAdrv.jl](https://julialang.org/blog/2017/03/cudanative)

* Small demo for GPU programming capabilities in Julia.
* Uses functionality from CUDAdrv.jl
* User friendly wrapper for interacting with CUDA hardware
* Provides and array type: CuArray
* Performs memory management through garbage collection integration with Julia
* @elapsed using GPU events

### Import dependent libraries

In [None]:
using CUDAdrv, CUDAnative
using Base.Test

### Define a vector addition function using CUDA kernel

In [None]:
function kernel_vadd(a, b, c)
    # from CUDAnative: (implicit) CuDeviceArray type,
    #                  and thread/block intrinsics
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]

    return nothing
end

### Specify target GPU

In [None]:
dev = CuDevice(0)
ctx = CuContext(dev)

### Generate some data

In [None]:
len = 512
a = rand(Int, len)
b = rand(Int, len)

### Allocate resources and upload variables on the GPU

In [None]:
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a)

### Run the code on the GPU and fetch the results

In [None]:
@cuda (1,len) kernel_vadd(d_a, d_b, d_c)    # from CUDAnative.jl
c = Array(d_c)

### Compare results from the code executed in CUDA kernels with CPU execution

In [None]:
@test c == a + b

### Free-up the GPU resources

In [None]:
destroy(ctx)

## [Parallel reduction using CUDA kernel -- Using CUDAnative.jl](https://julialang.org/blog/2017/03/cudanative)

* CUDAnative.jl takes care of all things related to native GPU programming

    1) interfacing with Julia: repurpose the compiler to emit GPU-compatible LLVM IR (no calls to CPU libraries, simplified exceptions, …)
    
    2) interfacing with LLVM (using LLVM.jl): optimize the IR, and compile to PTX
    
    3) interfacing with CUDA (using CUDAdrv.jl): compile PTX to SASS, and upload it to the GPU
    
    
* These functionalities hidden behind the call to @cuda

* Intrinsics: special functions and macros that provide functionality hard or impossible to express using normal functions. For example, the {thread,block,grid}{Idx,Dim} functions provide access to the size and index of each level of work.

* Creating local shared memory: @cuStaticSharedMem and @cuDynamicSharedMem macros

* Display formatted string from within a kernel function: @cuprintf

* [Math functions similar to the Julia standard library](https://github.com/JuliaGPU/CUDAnative.jl/blob/0721783db9ac4cc2c2948cbf8cbff4aa5f7c4271/src/device/intrinsics.jl#L499-L807)

### Import dependent libraries

In [2]:
using Test
using CUDAdrv, CUDAnative

In [3]:
#
# Main implementation
#

# Reduce a value across a warp
@inline function reduce_warp(op::F, val::T)::T where {F<:Function,T}
    offset = CUDAnative.warpsize() ÷ 2
    # TODO: this can be unrolled if warpsize is known...
    while offset > 0
        val = op(val, shfl_down(val, offset))
        offset ÷= 2
    end
    return val
end

reduce_warp (generic function with 1 method)

In [4]:
# Reduce a value across a block, using shared memory for communication
@inline function reduce_block(op::F, val::T)::T where {F<:Function,T}
    # shared mem for 32 partial sums
    shared = @cuStaticSharedMem(T, 32)

    wid, lane = fldmod1(threadIdx().x, CUDAnative.warpsize())

    # each warp performs partial reduction
    val = reduce_warp(op, val)

    # write reduced value to shared memory
    if lane == 1
        @inbounds shared[wid] = val
    end

    # wait for all partial reductions
    sync_threads()

    # read from shared memory only if that warp existed
    @inbounds val = (threadIdx().x <= fld(blockDim().x, CUDAnative.warpsize())) ? shared[lane] : zero(T)

    # final reduce within first warp
    if wid == 1
        val = reduce_warp(op, val)
    end

    return val
end

reduce_block (generic function with 1 method)

In [5]:
# Reduce an array across a complete grid
function reduce_grid(op::F, input::CuDeviceVector{T}, output::CuDeviceVector{T},
                     len::Integer) where {F<:Function,T}

    # TODO: neutral element depends on the operator (see Base's 2 and 3 argument `reduce`)
    val = zero(T)

    # reduce multiple elements per thread (grid-stride loop)
    # TODO: step range (see JuliaGPU/CUDAnative.jl#12)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    step = blockDim().x * gridDim().x
    while i <= len
        @inbounds val = op(val, input[i])
        i += step
    end

    val = reduce_block(op, val)

    if threadIdx().x == 1
        @inbounds output[blockIdx().x] = val
    end

    return
end

reduce_grid (generic function with 1 method)

In [6]:
"""
Reduce a large array.
Kepler-specific implementation, ie. you need sm_30 or higher to run this code.
"""
function gpu_reduce(op::Function, input::CuVector{T}, output::CuVector{T}) where {T}
    len = length(input)

    # TODO: these values are hardware-dependent, with recent GPUs supporting more threads
    threads = 512
    blocks = min((len + threads - 1) ÷ threads, 1024)

    # the output array must have a size equal to or larger than the number of thread blocks
    # in the grid because each block writes to a unique location within the array.
    if length(output) < blocks
        throw(ArgumentError("output array too small, should be at least $blocks elements"))
    end

    @cuda blocks=blocks threads=threads reduce_grid(op, input, output, len)
    @cuda threads=1024 reduce_grid(op, output, output, blocks)
end

gpu_reduce (generic function with 1 method)

In [7]:
if capability(device()) < v"3.0"
    @warn("this example requires a newer GPU")
    exit(0)
end

In [8]:
len = 10^7
input = ones(Int32, len)

10000000-element Array{Int32,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

### CPU Reduce example

In [9]:
cpu_val = reduce(+, input)

10000000

### CUDAnative Reduce example

In [10]:
gpu_input = CuArray(input)
gpu_output = similar(gpu_input)
gpu_reduce(+, gpu_input, gpu_output)

In [11]:
Array(gpu_output)[1]

10000000

### Run test

In [12]:
let
    gpu_input = CuArray(input)
    gpu_output = similar(gpu_input)
    gpu_reduce(+, gpu_input, gpu_output)
    gpu_val = Array(gpu_output)[1]
    @assert cpu_val == gpu_val
    Test.@test cpu_val == gpu_val
end

[32m[1mTest Passed[22m[39m