This notebook compares two implementations of argmax using different algorithms. Segmented_reduce and unary_transform. At the time of writting this notebook unary_transform works much better, since it uses 1 thread per segment, while segmented_reduce is using 1 block per segment. 

In [1]:
import awkward as ak
import cupy as cp
import numpy as np

from cuda.compute import segmented_reduce, ZipIterator, gpu_struct, reduce_into, CountingIterator, unary_transform

### Using segmented_reduce

In [10]:
def cccl_argmax(awkward_array):
    @gpu_struct
    class ak_array:
        data: cp.float64
        local_index: cp.int64

    # compare the values of the arrays
    def max_op(a: ak_array, b: ak_array):
        return a if a.data > b.data else b

    input_data = awkward_array.layout.content.data
    # use an internal awkward function to get the local indicies
    local_indicies = ak.local_index(awkward_array, axis=1)
    local_indicies = local_indicies.layout.content.data

    #Combine data and their indicies into a single structure
    #input_struct = cp.stack((input_data, parents), axis=1).view(ak_array.dtype)
    input_struct = ZipIterator(input_data, local_indicies)

    # Prepare the start and end offsets
    offsets = awkward_array.layout.offsets.data
    start_o = offsets[:-1]
    end_o = offsets[1:]

    # Prepare the output array
    n_segments = start_o.size
    output = cp.zeros([n_segments], dtype= ak_array.dtype)

    # Initial value for the reduction
    h_init = ak_array(-1, -1)

    # Perform the segmented reduce
    segmented_reduce(
        input_struct, output, start_o, end_o, max_op, h_init, n_segments
    )

    return output

In [None]:
awkward_array1 = ak.to_backend(ak.from_parquet("random_listoffset_small.parquet"), 'cuda')

Let's take a look at our array:

In [12]:
awkward_array1

In [13]:
result = cccl_argmax(awkward_array1)
result

array([(0.98264614, 0), (0.744157  , 0), (0.74000209, 0), ...,
       (0.81507469, 0), (0.56703317, 0), (0.61855135, 1)],
      shape=(10000000,), dtype={'names': ['data', 'local_index'], 'formats': ['<f8', '<i8'], 'offsets': [0, 8], 'itemsize': 16, 'aligned': True})

The results are correct ~ Now let's run it a couple more time and see how long does it take...

In [26]:
%%timeit -r 7 -n 100
result = cccl_argmax(awkward_array1)
cp.cuda.Device().synchronize()

71.1 ms ± 619 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Using unary_transform

In [17]:
def cccl_argmax_new(awkward_array):
    input_data = awkward_array.layout.content.data

    # Prepare the start and end offsets
    offsets = awkward_array.layout.offsets.data
    start_o = offsets[:-1]
    end_o = offsets[1:]

    # Prepare the output array
    n_segments = start_o.size
    output = cp.empty([n_segments], dtype=np.int64)

    def segment_reduce_op(segment_id: np.int64) -> np.int64:
        start_idx = start_o[segment_id]
        end_idx = end_o[segment_id]
        segment = input_data[start_idx:end_idx]
        if len(segment) == 0:
            return -1
        return np.argmax(segment)

    segment_ids = CountingIterator(np.int64(0))
    unary_transform(segment_ids, output, segment_reduce_op, n_segments)

    return output

In [22]:
# warmup run
result = cccl_argmax_new(awkward_array1)

In [19]:
result

array([0, 0, 0, ..., 0, 0, 1], shape=(10000000,))

Now let's time how long does it take on avarage

In [25]:
%%timeit -r 7 -n 100
result = cccl_argmax_new(awkward_array1)
cp.cuda.Device().synchronize()

The slowest run took 5.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1.18 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Much better performance!

### Test performance with cuda-kernels (old implementation)

In [27]:
ak.argmax(awkward_array1, axis = 1) # warmup

In [28]:
%%timeit -r 7 -n 100
ak.argmax(awkward_array1, axis = 1)
cp.cuda.Device().synchronize()

7.77 ms ± 845 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Using cccl allows us to have almost a 10 times acceleration for medium sized arrays (280 Mb in this example)