# GPU programming using PyOpenCL, part 5: Scan

This is a set of exercises on the usage of PyOpenCL. There are 6 main classes of parallel algorithms:
1. Map or Element-wise kernels: 1 thread calculates 1 result from 1 input position
2. Gather: 1 thread calculates 1 result from several input data, one typical example is the convolution
3. Scatter: 1 thread uses 1 input element and scatters it on one or several output pixels, this requires the usage of atomic operarions
4. Reduction: Apply the same associative operation on all element of an ensemble, for example the sum of all elements in a list.
5. Scan: also called *prefix sum*, this algorithm applies the same associative operation to all *previous* elements of a list, for example a cummulative sum (cumsum)
6. Sort: using sorting network like the bitonic sort.

This fith tutorial focuses on the **Scan** operation where one applies the same associative operation to all element previously in the input array. 
The result is an array with as many elements as the input array!





Two main algorithms exists: 

* [Hillis and Steele](https://en.wikipedia.org/wiki/Prefix_sum?Parallel%20algorithms#Algorithm_1:_Shorter_span,_more_parallel)

 ![Hillis and Steele](Hillis_Steele.svg)



* [Blelloch](https://en.wikipedia.org/wiki/Prefix_sum?Parallel%20algorithms#Algorithm_2:_Work-efficient)

 ![Blelloch](Prefix_sum_16.svg)
  

The later algorithm is similar to the reduction algorythm applied twice. Dues to the limited time, this algorithm will only be demonstrated using metaprogramming.

In [None]:
import numpy as np
import pyopencl as cl
from pyopencl import array as cla
ctx = cl.create_some_context(interactive=False)
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

size = 1000
a = np.random.randint(0, 100, size=size).astype("int32")
a_d = cla.to_device(queue, a)
out_d = cla.empty_like(a_d)

print(ctx.devices[0])
# %load_ext pyopencl.ipython_ext

In [None]:
from pyopencl.scan import GenericScanKernel
cumsum = GenericScanKernel(ctx, np.int32,
                            arguments="__global int *ary, __global int *out",
                            input_expr="ary[i]",
                            scan_expr="a+b", 
                            neutral="0",
                            output_statement="out[i] = item;")

cumsum(a_d, out_d)
assert np.allclose(out_d.get(), np.cumsum(a))

## Now it is your turn: Sparsification kernel

Write a sparsification kernel that takes one input and two outputs: *index* and *data*.
The input array has many zeros, if the value of an element is not null, save the position in the *index* and its value in the *data* array.

In [None]:
size = 1<<24
nnz = 10
all_idx = np.arange(size)
np.random.shuffle(all_idx)
index = all_idx[:nnz]
index.sort()
values = np.random.poisson(10, size=nnz)
dense = np.zeros(size, dtype="int32")
dense[index] = values

In [None]:
dense_d = cla.to_device(queue, dense)
index_d = cla.zeros_like(dense_d)
values_d = cla.zeros_like(dense_d)

sparsify = GenericScanKernel(ctx, np.int32,
                             arguments="__global int *dense, __global int *index, __global int *values",
                             input_expr="(dense[i] > 00) ? 1 : 0",
                             scan_expr="a+b", 
                             neutral="0",
                             output_statement="if (prev_item != item) {index[item-1] = i; values[item-1] = dense[i];};")

evt = sparsify(dense_d, index_d, values_d)
evt.wait()
print(f"Profile time: {(evt.profile.end-evt.profile.start)*1e-6:.3f}ms")
assert np.allclose(index, index_d.get()[:nnz])
assert np.allclose(values_d.get()[:nnz], values)

## More difficult exercise:

Write a byte-offset algorithm that substract the previous value from the current and stores the results on:
* 1 byte if the value is between -127 and 127
* 3 bytes (-128+value on 2 bytes little endian) if the difference value is between -32767 and 32767
* 7 bytes (-128, -128, 0 + value on 4 bytes little endian) if the difference value is larger.

This algorithm is used by Pilatus detector to compress the data before saving, it offers a factor 4 for compression.

In [None]:
import numpy as np
size = 100
ary = np.random.poisson(100, size=size)

In [None]:
import numpy
def compByteOffset_numpy(data):
    """
    Compress a dataset into a string using the byte_offet algorithm

    :param data: ndarray
    :return: string/bytes with compressed data

    test = numpy.array([0,1,2,127,0,1,2,128,0,1,2,32767,0,1,2,32768,0,1,2,2147483647,0,1,2,2147483648,0,1,2,128,129,130,32767,32768,128,129,130,32768,2147483647,2147483648])

    """
    flat = numpy.ascontiguousarray(data.ravel(), numpy.int64)
    delta = numpy.zeros_like(flat)
    delta[0] = flat[0]
    delta[1:] = flat[1:] - flat[:-1]
    mask = abs(delta) > 127
    exceptions = numpy.nonzero(mask)[0]
    if numpy.little_endian:
        byteswap = False
    else:
        byteswap = True
    start = 0
    binary_blob = b""
    for stop in exceptions:
        if stop - start > 0:
            binary_blob += delta[start:stop].astype(numpy.int8).tobytes()
        exc = delta[stop]
        absexc = abs(exc)
        if absexc > 2147483647:  # 2**31-1
            binary_blob += b"\x80\x00\x80\x00\x00\x00\x80"
            if byteswap:
                binary_blob += delta[stop:stop + 1].byteswap().tobytes()
            else:
                binary_blob += delta[stop:stop + 1].tobytes()
        elif absexc > 32767:  # 2**15-1
            binary_blob += b"\x80\x00\x80"
            if byteswap:
                binary_blob += delta[stop:stop + 1].astype(numpy.int32).byteswap().tobytes()
            else:
                binary_blob += delta[stop:stop + 1].astype(numpy.int32).tobytes()
        else:  # >127
            binary_blob += b"\x80"
            if byteswap:
                binary_blob += delta[stop:stop + 1].astype(numpy.int16).byteswap().tobytes()
            else:
                binary_blob += delta[stop:stop + 1].astype(numpy.int16).tobytes()
        start = stop + 1
    if start < delta.size:
        binary_blob += delta[start:].astype(numpy.int8).tobytes()
    return numpy.frombuffer(binary_blob, "int8")

In [None]:
ref = compByteOffset_numpy(ary)

In [None]:
ref