GPU Data Frame: Technical Overview
The basic approach for the GPU Data Frame (GDF) is pretty simple: if applications and libraries agree on an in-memory data format for tabular data and associate metadata, then just a device pointer to the data structure need be exchanged. Additionally, the IPC mechanism built into the CUDA driver allows device pointers to be moved between processes.
Currently, the GDF format is a subset of the Apache Arrow specification. The precise subset has not been fully defined yet, but currently includes numerical columns, and will soon include dictionary-encoded columns (sometimes called "categorical" columns in other data frame systems).
Fundamentally, one can implement GDF by following the Arrow specification directly. In some cases, that is the easiest approach. However, there are some common operations that we expect many GDF-supporting applications will need. To help jumpstart other GDF users, we are working to develop several layers of GDF functionality that can be reused in other projects:
[Much of this functionality is still in progress...]
libgdf: A C library of helper functions, including:
- Copying the GDF metadata block to the host and parsing it to a host-side struct. (typically needed for function dispatch)
- Importing/exporting a GDF using the CUDA IPC mechanism.
- CUDA kernels to perform element-wise math operations on GDF columns.
- CUDA sort, join, and reduction operations on GDFs.
pygdf: A Python library for manipulating GDFs
- Creating GDFs from numpy arrays and Pandas DataFrames
- Performing math operations on columns
- Import/export via CUDA IPC
- Sort, join, reductions
- JIT compilation of group by and filter kernels using Numba
- Same operations as pygdf, but working on GDFs chunked onto different GPUs and different servers.