# GPU Computing in Julia

This session introduces GPU computing in Julia.

## GPGPU

GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.

| NVIDIA GPUs         | Tesla K80                            | GTX 1080                                 | GT 650M                              |
|---------------------|----------------------------------------|-----------------------------------------|--------------------------------------|
|                     | ![Tesla M2090](nvidia_k80.jpg) | ![GTX 580](nvidia_gtx1080.jpg)    | ![GT 650M](nvidia_gt650m.jpg) |
| Computers           | servers, cluster                       | desktop                                 | laptop                               |
|                     | ![Server](gpu_server.jpg)       | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png)  |
| Main usage          | scientific computing                   | daily work, gaming                      | daily work                           |
| Memory              | 24 GB                                    | 8 GB                                   | 1GB                                  |
| Memory bandwidth    | 480 GB/sec                              | 320 GB/sec                               | 80GB/sec                             |
| Number of cores     | 4992                                    | 2560                                     | 384                                  |
| Processor clock     | 562 MHz                                 | 1.6 GHz                                  | 0.9GHz                               |
| Peak DP performance | 2.91 TFLOPS                              | 257 GFLOPS                                        |                                      |
| Peak SP performance | 8.73 TFLOPS                            | 8228 GFLOPS                              | 691Gflops                            |

GPU architecture vs CPU architecture.  
* GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC  
* Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm  
* Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

| ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) |
|----------------------------------|------------------------------------|
| ![Einstein](einstein.png) | ![Rain man](rainman.png)    |

## GPGPU in Julia

GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. 

There are at least three paradigms to program GPU in Julia.

- **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...

  The [CuArrays.jl](https://github.com/JuliaGPU/CuArrays.jl) package allows defining arrays on Nvidia GPUs and overloads many common operations. CuArrays.jl supports Julia v1.0+.

- **OpenCL** is a standard supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.

  The [CLArrays.jl](https://github.com/JuliaGPU/CLArrays.jl) package allows defining arrays on OpenCL devices and overloads many common operations.

- [**ArrayFire**](https://arrayfire.com) is a high performance library that works on both CUDA or OpenCL framework.

  The [ArrayFire.jl](https://github.com/JuliaGPU/ArrayFire.jl) package wraps the library for julia.

- **Warning:** Most recent Apple operating system iOS 10.15 (Catalina) does **not** support CUDA yet.

I'll illustrate using CuArrays on my Linux box running CentOS 7. It has a NVIDIA GeForce RTX 2080 Ti OC with 11GB GDDR6 (14 Gbps) and 4352 cores.

In [1]:
versioninfo()

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)


## Query GPU devices in the system

In [2]:
using CuArrays, CUDAdrv

# check available devices on this machine and show their capability
for device in CuArrays.devices()
    @show capability(device)
end

capability(device) = v"7.5.0"


## Transfer data between main memory and GPU

In [3]:
# generate data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = CuArray(x)

3×3 CuArray{Float32,2,Nothing}:
 0.181051  0.114813  0.10958
 0.360586  0.211851  0.66229
 0.773347  0.581369  0.874378

In [4]:
# generate array on GPU directly
yd = ones(CuArray{Float32}, 3, 3)

3×3 CuArray{Float32,2,Nothing}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

In [5]:
# collect data from GPU to CPU
collect(xd)

3×3 Array{Float32,2}:
 0.181051  0.114813  0.10958
 0.360586  0.211851  0.66229
 0.773347  0.581369  0.874378

## Elementiwise operations on GPU

In [6]:
zd = log.(yd .+ sin.(xd))

3×3 CuArray{Float32,2,Nothing}:
 0.165569  0.108461  0.103784
 0.302193  0.190844  0.479288
 0.529766  0.437719  0.569365

In [7]:
# getting back x
asin.(exp.(zd) .- yd)

3×3 CuArray{Float32,2,Nothing}:
 0.181051  0.114813  0.10958
 0.360586  0.211851  0.66229
 0.773347  0.581369  0.874377

## Linear algebra

In [8]:
using LinearAlgebra

zd = zeros(CuArray{Float32}, 3, 3)
mul!(zd, xd, yd)

3×3 CuArray{Float32,2,Nothing}:
 0.405445  0.405445  0.405445
 1.23473   1.23473   1.23473
 2.22909   2.22909   2.22909

In [9]:
using BenchmarkTools

n = 1024
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = CuArray(x)
yd = CuArray(y)
zd = CuArray(z)

# SP matrix multiplication on GPU
@benchmark mul!($zd, $xd, $yd)

BenchmarkTools.Trial: 
  memory estimate:  224 bytes
  allocs estimate:  4
  --------------
  minimum time:     4.453 μs (0.00% GC)
  median time:      175.771 μs (0.00% GC)
  mean time:        169.631 μs (0.00% GC)
  maximum time:     1.309 ms (0.00% GC)
  --------------
  samples:          4202
  evals/sample:     7

In [10]:
# SP matrix multiplication on CPU
@benchmark mul!($z, $x, $y)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.752 ms (0.00% GC)
  median time:      7.810 ms (0.00% GC)
  mean time:        7.943 ms (0.00% GC)
  maximum time:     19.237 ms (0.00% GC)
  --------------
  samples:          629
  evals/sample:     1