# GPU programming with PyOpenCL

Layout

1. Modern CPU are not that different from GPU
2. Introduction to OpenCL 
3. Comparison with julia, numba, numba-cuda, pycuda, cupy
4. First kernel 
5. Parallel programming design patterns
6. Metaprogramming in PyOpenCL
7. Conclusions

![CPU vs GPU](cpu_gpu.png)

## Introduction to OpenCL
* Vendor & platform neutral parallel programming language to address CPU, GPU, FPGA and other types of accelerators. 
* Initated by Apple in 2008 and managed by the Kronos group, currently at version 3.0. 

* There is a strict separation between host code and device code (called kernels).
* Kernels are written in a subset of C99 (without pointers) and compiled at _runtime_.

* Bindings for host code, initially in C are available for all programming languages: Python.

* Sycl (Intel) is the descendent of OpenCL where C++ host code can be mixed with kernels (à la CUDA).

* There is a direct mapping between CUDA kernels and OpenCL (version 1.2).

## Definition:

In manycore-devices programming, a `kernel` is the core part of the computational code, without the outer loop, and runs in parallel on a device.

In [1]:
import numpy
size = 1000
a = numpy.random.random(size)
b = numpy.random.random(size)
res = numpy.empty_like(a)

for idx in range(size):
    res[idx] = a[idx] + b[idx]

The kernel is `res[i] = a[i] + b[i]`.

The size of the problem, `size`, is precised at runtime, usually in host code.

The position of the working element `idx` is obtained via specific API calls.

## Comparison with other programming languages ...


### Julia

* Julia's strength resides in the just-in-time compilation with strong dynamic typing.
* Julia has a strong support for GPU programming.
* Since OpenCL is compiled at runtime, so similar to just-in-time, but with static types (without templates).

### Numba

* Numba's strength resides in the just-in-time compilation with strong dynamic typing ... in Python.
* Not all Python is supported
* Heavy LLVM dependency ... but not worse than the OpenCL driver


### Numba-cuda

* Same as `numba` but targeting Nvidia's GPU
* Limited to element-wise, stencil and reduction kernels
* It is strange to write kernels in Python ...
*   ... Python becomes statically typed !

### Cupy

* Cuda interface with `numpy` like functions (same signature)
* automatic management of memory...
* ... hence huge overhead due to memory transfers.

### PyCuda
* Same author and philosophy as PyOpenCL: Expose everything, then add some metaprogramming sugar
* Explicit kernel in C(++) and explicit memory management
* Runtime compilation (unlike Cuda which is ahead of time)
* Interfaces to advanced libraries like `cublas`, `cufft`, ... via `scikit-cuda`

## PyOpenCL

* Expose the full OpenCL API
* Explicit kernels in C99 
* Explicit memory management
* Explicit device and context management
* Asynchronous execution with queues and events
* Runtime compilation (since the driver contains the compiler)
* Many metaprogramming features to expose advanced parallel algorithms
* Integration in Jupyter notebooks

In [1]:
%load_ext pyopencl.ipython_ext

In [2]:
import numpy as np
import pyopencl as cl
import pyopencl.array as cla

ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(ctx, 
                         properties=cl.command_queue_properties.PROFILING_ENABLE)

Choose platform:
[0] <pyopencl.Platform 'Portable Computing Language' at 0x7f0ef1caf008>
[1] <pyopencl.Platform 'NVIDIA CUDA' at 0x2114490>
[2] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7f0ee9c50f30>
[3] <pyopencl.Platform 'Intel(R) OpenCL' at 0x2122b40>
Choice [0]:2
Set the environment variable PYOPENCL_CTX='2' to avoid being asked again.


In [3]:
# Array creation
shape = (10000000,)
a = np.random.random(shape).astype(np.float32)
b = np.random.random(shape).astype(np.float32)

#Reference result
ref = a+b 

# Send data to the device and prepare output buffer
a_d = cla.to_device(queue, a)
b_d = cla.to_device(queue, b)
res_d = cla.empty_like(a_d)

In [4]:
%%cl_kernel

kernel void add(global float* a,
                global float* b,
                global float* res){
    
    int idx = get_global_id(0);
    res[idx] = a[idx] + b[idx];
}

In [5]:
evt = add(queue, shape, None,
          a_d.data, b_d.data, res_d.data)
print(evt)

<pyopencl._cl.Event object at 0x7f0ef50cb8b0>


In [6]:
np.allclose(ref, res_d.get())

True

In [7]:
print(f"Execution time on GPU: {1e-6*(evt.profile.end-evt.profile.start):.3f} ms\nExecution time on CPU:")

%timeit a+b

Execution time on GPU: 1.690 ms
Execution time on CPU:
7.48 ms ± 76.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Parallel programming design patterns

* Map: Pixel wise operations like the `add` kernel.
* Gather: Stencil like operation, for example convolutions.
* Scatter: write at variable position, requires atomic operations.
* Reduction: perform the sum for all elements in an array.
* Scan: Perform the `cumsum`, sum of all previous elements, used in compactions.
* Sort: Bitonic sort is one example of parallel sort.
* ...

Beside `Map`, all kernel require dozens to hundreeds of lines of code to implement the algorithm!

PyOpenCL provides templates to all those algorithm making programmer's life simpler

### Generating `Map` kernels

In [8]:
from pyopencl.elementwise import ElementwiseKernel

t_add = ElementwiseKernel(ctx, 
                          arguments="float* a, float* b, float* res", 
                          operation="res[i] = a[i] + b[i]")

#reset the destination array:
res_d.fill(0)

t_add(a_d, b_d, res_d)
np.allclose(ref, res_d.get())

True

In [9]:
# Even more trivial:
c_d = a_d + b_d

%time a_d + b_d

type(c_d)

CPU times: user 66 µs, sys: 67 µs, total: 133 µs
Wall time: 138 µs


pyopencl.array.Array

### Reduction kernel, like the sum of all elements in an array:
![Reduction kernel](kernel-code-sum-reduction.png)

One can use it to perform the scalar product ...

In [10]:
# Dot product implemented as reduction kernel
from pyopencl.reduction import ReductionKernel
dot = ReductionKernel(ctx, 
                      dtype_out=np.float32, 
                      neutral="0",
                      reduce_expr="a + b", 
                      map_expr="a[i] * b[i]",
                      arguments="__global float* a, __global float* b")

np.isclose(dot(a_d, b_d).get(), np.dot(a,b.astype(np.float64)))

True

In [11]:
%timeit dot(a_d, b_d).get()
a64 = a.astype(np.float64)
b64 = b.astype(np.float64)
%timeit np.dot(a64,b64)

699 µs ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.61 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Scan kernel, sum of all previous elements from the array:
![Scan kernel](300px-Prefix_sum_16.svg.png)
Their typical application is in compaction or in compression algorithms. 
In `numpy` this is implemented by the `cumsum` function:

In [12]:
from pyopencl.scan import GenericScanKernel
cumsum = GenericScanKernel(ctx, 
                           np.float32,
                           arguments="__global float* ary, __global float* out",
                           input_expr="ary[i]",
                           scan_expr="a+b", 
                           neutral="0",
                           output_statement="out[i] = item;")

cumsum(a_d, res_d)
np.allclose(res_d.get(),np.cumsum(a64))

True

In [13]:
%timeit cumsum(a_d, res_d).wait()
%timeit np.cumsum(a64)

1.03 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
40 ms ± 99.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


`Scan` kernels are at the core of **memory compaction** but also of **compression** algorithm, as demonstrated in:
https://doi.org/10.1107/S1600577518000607

## Conclusion

* PyOpenCL is an interesting Python binding for doing GPU programming:
  - Platform independant (i.e. without Nvidia lock-in)
  - Comfort of Python and Jupyter
  - Full control of execution and memory management
  - Scales to larger projects (`silx`, `pyFAI`)
  - Can also exploit manycore CPUs
  - Great for continuous integration (with Intel driver)
  - Fully open source driver exists (PortableCL)
* Parallel programming design pattern are well documented
  - Knowing them allows to address performance issues with the proper tool
  - Most of them are already implemented into PyOpenCL/PyCuda via metaprogramming
* Kudos to Andreas Kloeckner, author of PyOpenCL