This lecture introduces GPU computing in Julia.

## GPGPU

GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today's typical computer systems.

| NVIDIA GPUs         | H100 PCIe                           | RTX 6000                                 | RTX 5000                              |
|---------------------|----------------------------------------|-----------------------------------------|--------------------------------------|
|                     | ![H100](nvidia_h100.png) | ![RTX 6000](nvidia_rtx6000.png)    | ![RTX 5000](nvidia_rtx5000.png) |
| Computers           | servers, cluster                       | desktop                                 | laptop                               |
|                     | ![Server](gpu_server.jpg)       | ![Desktop](alienware-area51.png) | ![Laptop](macpro_inside.png)  |
| Main usage          | scientific computing                   | daily work, gaming                      | daily work                           |
| Memory              | 80 GB                                    | 48 GB                                   | 16 GB                                  |
| Memory bandwidth    | 2 TB/sec                              | 960 GB/sec                               | 576 GB/sec                             |
| Number of cores     | ???                                    | ???                                     | ???                                  |
| Processor clock     | ??? GHz                                 | ??? GHz                                  | ??? GHz                               |
| Peak DP performance | 26 TFLOPS                              | ??? TFLOPS                                        |                                    ??? TFLOPS  |
| Peak SP performance | 51 TFLOPS                            | 91.1 TFLOPS                              | 42.6 TFLOPS                            |

## GPU architecture vs CPU architecture

* GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC  

* Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm  

* Extremely high arithmetic intensity *if* one can transfer the data onto and results off of the processors quickly

| ![i7 die](cpu_i7_die.png) | ![Fermi die](Fermi_Die.png) |
|----------------------------------|------------------------------------|
| ![Einstein](einstein.png) | ![Rain man](rainman.png)    |

## GPGPU in Julia

GPU support by Julia is under active development. Check [JuliaGPU](https://github.com/JuliaGPU) for currently available packages. 

There are multiple paradigms to program GPU in Julia, depending on the specific hardware.

- **CUDA** is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...

  The [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) package allows defining arrays on **Nvidia GPUs** and overloads many common operations.

- The [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) package allows defining arrays on **AMD GPUs** and overloads many common operations.

- The [Metal.jl](https://github.com/JuliaGPU/Metal.jl) package allows defining arrays on **Apple Silicon** GPU and overloads many common operations.  

    [AppleAccelerate.jl](https://github.com/JuliaLinearAlgebra/AppleAccelerate.jl) wraps the [macOS Accelerate framework](https://developer.apple.com/documentation/accelerate), which provides high-performance libraries for linear algebra, signal processing, and image processing on Apple Silicon CPU. This is analog of MKL for Intel CPU.

- The [oneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl) package allows defining arrays on **Intel GPUs** and overloads many common operations.

I'll illustrate using Metal.jl on my MacBook Pro running MacOS Sequoia 15.4. It has Apple M2 chip with 38 GPU cores.

In [32]:
versioninfo()

Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 13th Gen Intel(R) Core(TM) i7-13800H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)


Load packages:

In [33]:
using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()

[32m[1m  Activating[22m[39m project at `~/2025spring/slides/09-juliagpu`


[32m[1mStatus[22m[39m `~/2025spring/slides/09-juliagpu/Project.toml`
  [90m[6e4b80f9] [39mBenchmarkTools v1.6.0
  [90m[052768ef] [39mCUDA v5.7.3
  [90m[bdcacae8] [39mLoopVectorization v0.12.172
  [90m[37e2e46d] [39mLinearAlgebra v1.11.0


## Query GPU devices in the system

In [34]:
using CUDA

CUDA.versioninfo()

CUDA runtime 12.8, artifact installation
CUDA driver 12.4
NVIDIA driver 553.5.0

CUDA libraries: 
- CUBLAS: 12.8.4
- CURAND: 10.3.9
- CUFFT: 11.3.3
- CUSOLVER: 11.7.3
- CUSPARSE: 12.5.8
- CUPTI: 2025.1.1 (API 26.0.0)
- NVML: 12.0.0+550.117

Julia packages: 
- CUDA: 5.7.3
- CUDA_Driver_jll: 0.12.1+1
- CUDA_Runtime_jll: 0.16.1+0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

1 device:
  0: NVIDIA RTX 2000 Ada Generation Laptop GPU (sm_89, 173.738 MiB / 7.996 GiB available)


## Transfer data between main memory and GPU

In [35]:
using Random
Random.seed!(257)

# generate SP data on CPU
x = rand(Float32, 3, 3)
# transfer data form CPU to GPU
xd = CuArray(x)

3×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.145793  0.939801  0.479926
 0.567772  0.577251  0.81655
 0.800538  0.38893   0.914135

In [36]:
# generate array on GPU directly
# yd = Metal.ones(3, 3)
yd = CuArray(ones(Float32, 3, 3))

3×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

In [37]:
# collect data from GPU to CPU
x = collect(xd)

3×3 Matrix{Float32}:
 0.145793  0.939801  0.479926
 0.567772  0.577251  0.81655
 0.800538  0.38893   0.914135

## Linear algebra

In [38]:
using BenchmarkTools, LinearAlgebra, Random

Random.seed!(257)

n = 2^14
# on CPU
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)
# on GPU
xd = CuArray(x)
yd = CuArray(y)
zd = CuArray(z);

### Dot product

In [39]:
# SP matrix dot product on CPU: tr(X'Y)
bm_cpu = @benchmark dot($x, $y)

BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m120.225 ms[22m[39m … [35m297.097 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m216.021 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m208.404 ms[22m[39m ± [32m 45.054 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m▃[34m▃[39m[39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▇[39m▁

In [40]:
# SP matrix dot product on GPU: tr(X'Y)
# why are there allocations?
bm_gpu = @benchmark CUDA.@sync dot($xd, $yd)

BenchmarkTools.Trial: 471 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m10.163 ms[22m[39m … [35m 13.445 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m10.518 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m10.612 ms[22m[39m ± [32m422.558 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m▁[39m▆[39m█[39m▆[34m█[39m[39m▄[32m▄[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▃[39m▅[3

In [41]:
# speedup on GPU over CPU
median(bm_cpu.times) / median(bm_gpu.times)

20.538494061907674

### Broadcast

In [42]:
# SP broadcast on CPU: z .= x .* y
bm_cpu = @benchmark $z .= $x .* $y

BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m149.392 ms[22m[39m … [35m285.251 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m201.841 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m211.201 ms[22m[39m ± [32m 52.516 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[39m [39m▁[39m█[39m▁[39m▁[39m▁[39m▁[39m▁[39m [39m [39m [39m▁[39m [39m [39m▁[34m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m▁[39m [39m [39m [39m [39m▁[39m [39m▁[39m [39m [39m [39m [39m▁[39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m▁[39m▁[39m [39m█[39m [39m 
  [39m█[39m▁

In [43]:
# SP broadcast on GPU: z .= x .* y
# why is there allocation?
bm_gpu = @benchmark CUDA.@sync $zd .= $xd .* $yd

BenchmarkTools.Trial: 266 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m16.378 ms[22m[39m … [35m25.025 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m18.272 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m18.769 ms[22m[39m ± [32m 1.966 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m▄[39m▅[39m▁[39m█[39m▁[39m [39m▃[39m▄[39m▂[39m [39m [39m [39m▁[39m [39m▁[34m▆[39m[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▇[39m█[39m█[39m█[39m█

In [44]:
# speedup
median(bm_cpu.times) / median(bm_gpu.times)

11.04645564257901

### Matrix multiplication

In [45]:
# SP matrix multiplication on GPU
bm_gpu = @benchmark CUDA.@sync mul!($zd, $xd, $yd)

BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m1.760 s[22m[39m … [35m   2.216 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m1.880 s               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m1.952 s[22m[39m ± [32m236.160 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁[39m▁

For this problem size on this machine, we see GPU achieves a staggering **9 TFLOPS** throughput with single precision!

In [46]:
# SP throughput on GPU
(2n^3) / (minimum(bm_gpu.times) / 1e9)

4.997960387919036e12

In [47]:
# SP matrix multiplication on CPU
bm_cpu = @benchmark mul!($z, $x, $y)

BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took [34m25.606 s[39m (0.00% GC) to evaluate,
 with a memory estimate of [33m0 bytes[39m, over [33m0[39m allocations.

In [48]:
# SP throughput on CPU
(2n^3) / (minimum(bm_cpu.times) / 1e9)

3.4351371519838055e11

We see >10x speedup by GPUs in this matrix multiplication example.

In [49]:
median(bm_cpu.times) / median(bm_gpu.times)

13.619347867332454

### Cholesky

In [50]:
# cholesky on Gram matrix
# This one doesn't seem to work on Apple M2 chip yet
xtxd = xd'xd + I
bm_gpu = @benchmark CUDA.@sync cholesky($(xtxd))
bm_gpu

BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took [34m15.237 s[39m (0.02% GC) to evaluate,
 with a memory estimate of [33m7.03 KiB[39m, over [33m287[39m allocations.

In [51]:
xtx = collect(xtxd)
bm_cpu = @benchmark LinearAlgebra.cholesky($(Symmetric(xtx)))
bm_cpu

BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took [34m7.536 s[39m (0.00% GC) to evaluate,
 with a memory estimate of [33m1.00 GiB[39m, over [33m3[39m allocations.

We about 12x speedup of Cholesky by this NVIDIA GPU.

In [52]:
median(bm_cpu.times) / median(bm_gpu.times)

0.4945712003645284

## Evaluation of elementary and special functions on GPU

### Sine and log functions

In [53]:
# elementwise function on GPU arrays
fill!(yd, 1)
bm_gpu = @benchmark CUDA.@sync $zd .= log.($yd .+ sin.($xd))
bm_gpu

BenchmarkTools.Trial: 206 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m20.304 ms[22m[39m … [35m30.321 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m23.748 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m24.227 ms[22m[39m ± [32m 2.278 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▄[39m▅[39m█[39m [39m▁[39m [39m [34m [39m[39m [39m [39m [32m [39m[39m [39m [39m [39m [39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▂[39m▄[39m▄[39m▃

In [54]:
# elementwise function on CPU arrays
x, y, z = collect(xd), collect(yd), collect(zd)
bm_cpu = @benchmark $z .= log.($y .+ sin.($x))
bm_cpu

BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took [34m6.248 s[39m (0.00% GC) to evaluate,
 with a memory estimate of [33m0 bytes[39m, over [33m0[39m allocations.

In [55]:
# Speed up
median(bm_cpu.times) / median(bm_gpu.times)

263.10511991036014

GPU brings great speedup (>50x) to the massive evaluation of elementary math functions.

### tanh function

In [56]:
bm_cpu = @benchmark z .= tanh.($x) # on CPU
bm_cpu

BenchmarkTools.Trial: 3 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.227 s[22m[39m … [35m  2.333 s[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.245 s              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.268 s[22m[39m ± [32m56.895 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [34m█[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m 
  [34m█[39m[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[3

In [57]:
bm_gpu = @benchmark zd .= CUDA.@sync tanh.($xd) # GPU
bm_gpu

BenchmarkTools.Trial: 154 samples with 1 evaluation per sample.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m22.425 ms[22m[39m … [35m97.954 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 8.56% … 1.68%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m30.644 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m11.84%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m32.455 ms[22m[39m ± [32m 8.055 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m10.90% ± 4.63%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▄[39m [39m▂[39m▂[39m [39m [39m [39m▂[39m█[39m [39m▂[34m▂[39m[39m [39m [39m▂[39m [39m [32m [39m[39m [39m [39m [39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▁[39m▁[39m▁[3

Metal.jl accelerates the evaluation of tanh function by

In [58]:
median(bm_cpu.times) / median(bm_gpu.times)

73.2540472089944