# PYGDF Groupby Usage

In [1]:
import numpy as np
from pygdf import DataFrame

Make a GPU dataframe with random data for showcasing the different way we can use groupby.

In [2]:
df = DataFrame()
nelem = 10**6  # A million item
df['key1'] = np.random.randint(0, 5, nelem)
df['key2'] = np.random.randint(0, 3, nelem)
df['val1'] = np.arange(1, 1 + nelem)
df['val2'] = np.random.random(nelem)

In [3]:
df.head().to_pandas()

Unnamed: 0,key1,key2,val1,val2
0,3,1,1,0.603196
1,1,1,2,0.893075
2,3,0,3,0.239114
3,1,0,4,0.673365
4,1,1,5,0.402004


## Groupby reduce

The simplest usage of groupby is to apply a reduction to combine rows in the same group.

Here we use the columns `key1` and `key2` as the grouping key and get the mean of `val1` and `val2` for each group.

In [4]:
group_agg = df.groupby(by=['key1', 'key2']).mean()
group_agg.head(10).to_pandas()

Unnamed: 0,key1,key2,val1,val2
0,0,0,500595.741195,0.499919
1,0,1,498910.45639,0.499579
2,0,2,499876.631922,0.499918
3,1,0,501949.391134,0.500343
4,1,1,497984.556625,0.500716
5,1,2,500338.003388,0.498666
6,2,0,500047.012978,0.499611
7,2,1,499308.462004,0.498986
8,2,2,500457.22765,0.500878
9,3,0,501706.369686,0.502233


Sometimes, we want to apply a different reduction to each column and use more than one reductions of some columns.

Here, we will get the count and mean of `val1` and the standard deviation of `val2` for each group.

In [5]:
group_agg = df.groupby(by=['key1', 'key2']).agg({'val1': ['count', 'mean'], 'val2': ['std']})
group_agg.head(10).to_pandas()

Unnamed: 0,key1,key2,val1_count,val1_mean,val2
0,0,0,66525.0,500595.741195,0.499919
1,0,1,66682.0,498910.45639,0.499579
2,0,2,66706.0,499876.631922,0.499918
3,1,0,66453.0,501949.391134,0.500343
4,1,1,66508.0,497984.556625,0.500716
5,1,2,67003.0,500338.003388,0.498666
6,2,0,66498.0,500047.012978,0.499611
7,2,1,66954.0,499308.462004,0.498986
8,2,2,66699.0,500457.22765,0.500878
9,3,0,66662.0,501706.369686,0.502233


## Groupby Apply

For more advance usage, we can use `groupby.apply` to transform each group using a Python function.

Here, we will add two columns `val1_plus_val2`, which is the sum of `val1` and `val2`, and `intra_group_id`, which is the serial ID for each row in the group.

In [6]:
def transform(group_df):
    # `group_df` is a GPU DataFrame for a group
    group_df['val1_plus_val2'] = group_df['val1'] + group_df['val2']
    group_df['intra_group_id'] = np.arange(len(group_df))
    return group_df

group_appl = df.groupby(by=['key1', 'key2']).apply(transform)
group_appl.head(20).to_pandas()

Unnamed: 0,key1,key2,val1,val2,val1_plus_val2,intra_group_id
0,0,0,43,0.707849,43.707849,0
1,0,0,66,0.202822,66.202822,1
2,0,0,86,0.280125,86.280125,2
3,0,0,125,0.507459,125.507459,3
4,0,0,141,0.145419,141.145419,4
5,0,0,142,0.73596,142.73596,5
6,0,0,143,0.69384,143.69384,6
7,0,0,172,0.081564,172.081564,7
8,0,0,192,0.175048,192.175048,8
9,0,0,218,0.475765,218.475765,9


### Simple GPU Transformation 

For better performance, we can use `groupby.apply_grouped` to apply a CUDA kernel over each group.

Here, `transform_kernel` is the transformation function that will be compiled into a CUDA kernel by **numba**.  The function contains a loop that applies the transformation over each row of the group.  Each group is computed concurrently on the GPU.

In [7]:
def transform_kernel(val1, val2, val1_plus_val2, intra_group_id):
    for i in range(val1.size):
        val1_plus_val2[i] = val1[i] + val2[i]
        intra_group_id[i] = i
    
    
group_appl = df.groupby(by=['key1', 'key2']).apply_grouped(
    # the GPU kernel
    transform_kernel,
    # list of input columns
    incols=['val1', 'val2'],  
    # list of output columns and their dtype
    outcols={'val1_plus_val2': np.float64, 'intra_group_id': np.int32},  
)
group_appl.head(20).to_pandas()

Unnamed: 0,key1,key2,val1,val2,intra_group_id,val1_plus_val2
0,0,0,43,0.707849,0,43.707849
1,0,0,66,0.202822,1,66.202822
2,0,0,86,0.280125,2,86.280125
3,0,0,125,0.507459,3,125.507459
4,0,0,141,0.145419,4,141.145419
5,0,0,142,0.73596,5,142.73596
6,0,0,143,0.69384,6,143.69384
7,0,0,172,0.081564,7,172.081564
8,0,0,192,0.175048,8,192.175048
9,0,0,218,0.475765,9,218.475765


### Advance GPU Transformation

For those that are familiar with CUDA programming, `.apply_grouped` schedules the `transform_kernel` such that each threadblock is executing on a group.  By default, the threadblock size is set to 1 to emulate serial behavior.  To better utilize the GPU cores, we can increase the number of threads per block and change the loop range to stride over the number of available threads so that each thread computes a row in the group.  See below:

In [8]:
import math
from numba import cuda

def transform_kernel(key1, key2, val1, val2, sin1, cos2, intra_group_id):
    # This loop range will utilize all threads in the block and scheduling each thread
    # to work on a row in the group.
    for i in range(cuda.threadIdx.x, key1.size, cuda.blockDim.x):
        groupsize = key1.size
        sin1[i] = math.sin(val1[i] / groupsize) 
        cos2[i] = math.cos(val2[i]) 
        intra_group_id[i] = i
    
    
group_appl = df.groupby(by=['key1', 'key2']).apply_grouped(
    transform_kernel,
    incols=['key1', 'key2', 'val1', 'val2'],
    outcols={'sin1': np.float64, 'cos2': np.float64, 'intra_group_id': np.int32},
    # Set the thread-per-block for the GPU kernel.
    # It's defaulted to 1 to emulate serial behavior.
    tpb=64,
)
group_appl.head(20).to_pandas()

Unnamed: 0,key1,key2,val1,val2,cos2,intra_group_id,sin1
0,0,0,43,0.707849,0.759762,0,0.000646
1,0,0,66,0.202822,0.979502,1,0.000992
2,0,0,86,0.280125,0.961021,2,0.001293
3,0,0,125,0.507459,0.873982,3,0.001879
4,0,0,141,0.145419,0.989445,4,0.00212
5,0,0,142,0.73596,0.741187,5,0.002135
6,0,0,143,0.69384,0.768796,6,0.00215
7,0,0,172,0.081564,0.996676,7,0.002585
8,0,0,192,0.175048,0.984718,8,0.002886
9,0,0,218,0.475765,0.888943,9,0.003277


### Performance

Finally, we compare the GPU `groupby.apply_grouped` to doing the same operation in **pandas**.

Note: The `.apply_grouped` is a new function in **pygdf**.  For **pandas**, we can only use the `.apply` function.

Time the execution of the GPU groupby transform operation:

In [9]:
%%timeit

group_appl = df.groupby(by=['key1', 'key2']).apply_grouped(
    transform_kernel,
    incols=['key1', 'key2', 'val1', 'val2'],
    outcols={'sin1': np.float64, 'cos2': np.float64, 'intra_group_id': np.int32},
    tpb=64,
)


72 ms ± 4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Make pandas DataFrame

In [10]:
pddf = df.to_pandas()
type(pddf)

pandas.core.frame.DataFrame

Time the execution of the pandas groupby transform operation:

In [11]:
%%timeit

def pd_transform_kernel(df):
    df['sin1'] = np.sin(df['val1'] / len(df))
    df['cos2'] = np.cos(df['val2'])
    df['intra_group_id'] = np.arange(len(df), dtype=np.int32)
    return df

out = pddf.groupby(by=['key1', 'key2']).apply(pd_transform_kernel)

370 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
