# 28) Intro to GPU programming in Julia 

Last time:
- Practical CUDA
- Memory
- Tuckoo demo for CUDA codes

Today:
1. Intro to GPU programming in Julia (CUDA.jl)

## 1. Intro to GPU programming in Julia (CUDA.jl)

For GPU programming there are many great resources. Some that I may refer to are:

:::{note} References
- [Warburton | youtube video](https://www.youtube.com/watch?v=uvVy3CqpVbM) (In the first 47 minutes of the video,Tim gives an excellent introduction to the GPU.)
- [Warburton | ATPESC pdf](https://extremecomputingtraining.anl.gov/files/2018/08/ATPESC_2018_Track-2_3_8-2_830am_Warburton-Accelerators.pdf)
:::

### Add vectors

The real "hello world" of the GPU is adding vectors:

$$
C = A + B
$$

Big ideas:

- threads
- blocks
- how to launch and time kernels
- off-loaded memory on the device

### Example with [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)

To run this and following Julia CUDA examples, you need to first add [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) to your environment, with

```julia
using Pkg
Pkg.add("CUDA")
```

Then you can execute the following script:


```{literalinclude} ../julia_codes/module8-1/add_cu_arrays.jl
:language: julia
:linenos: true
```

### Memory management

As we have seen so far, a crucial aspect of working with a GPU is managing the data on it. 

The `CuArray` type is the primary interface for doing so: Creating a `CuArray` will allocate data on the GPU, copying elements to it will upload, and converting back to an `Array` will download values to the CPU. Let's see it in an example:

```{literalinclude} ../julia_codes/module8-1/copy_cu_array.jl
:language: julia
:linenos: true
```

**Observation on garbage collection**:

- One striking difference between the native C CUDA implementation and the Julia CUDA.jl interface is that instances of the `CuArray` type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is _no need_ for manual memory management (like the `cudaFree`), just make sure your objects are not reachable (i.e., there are no instances or references).

### Reverse vectors

The "hello world 2.0" of the GPU is (inplace) reverse vector

$$
A_i := A_{N - i + 1}, \textrm{with } i = 1, \ldots, N/2
$$

Big ideas:

- thread independence
- [race conditions](https://en.wikipedia.org/wiki/Race_condition)