-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reusable Numba extension for CUDA target? #359
Comments
That's awesome! Composable list and struct types within a DataFrame-like context would really help out physicists (among other data analysts, I'm sure), especially if cuDF can also run without GPUs. (The ability to access data types in this way is useful in itself, even if users don't have access to compatible GPUs on their Macs or in CERN's computing farm.) This could also help the deprecation I'm considering in #350. Just getting nested data types into Pandas hasn't been useful in itself, since the Pandas API doesn't have operations that know how to make use of them. Presumably, if you're adding these types to the DataFrame in a non-opaque way, then you'll also be adding operations that use them—for example, performing Cartesian products of nested lists or turning struct fields into columns and vice-versa. Will the buffer backing these new data types simply be an Arrow view? If so, then we can share more than Numba code; it would be possible to apply Awkward's array-at-a-time functions to columns of a cuDF in a similar way that NumPy's can be applied to a Pandas DataFrame. Awkward's own internal representation is more general than Arrow (e.g. ListArray vs ListOffsetArray, a useful distinction when manipulating list structures), and it is zero-copy convertible when equivalent constructs are available (ak.to_arrow and ak.from_arrow). As for the Numba code, here is where it is located: src/awkward1/_connect/_numba. The data model is based on Awkward's node types, not Arrow, and its primary focus is on lightweight iteration. For that reason, we don't have Numba models for each node type (e.g. ListArray, ListOffsetArray) because these things may be created and destroyed frequently during an iterative loop and Numba's models are by default pass-by-value (and it's not easy to make something pass-by-reference). Copying deep tree structures in every step of iteration would scale poorly. Instead, our Numba model is an ArrayView that walks over a Lookup data structure. The Lookup is a set of pointers to all the buffers in the original Awkward Array and the ArrayView represents a slice at some level of depth. The way to properly walk over the Lookup is enforced at compile-time, since each node type generates the appropriate code for My goal for Awkward-Numba-CUDA would be to reuse most of the infrastructure for Awkward-Numba (because it works) and replace the first level of iteration over a large array with the ability of users to write kernels on a single element of that large array. Walking over lists and structs deeper than the first level would be the same, even though it encourages users to write imperative code that might not be optimal on GPUs (e.g. users might write code with a lot of if-branches, but that would be their mistake to make). Separating the Awkward-Numba part into a library of its own (whether CUDA-enabled or not) would be a little tricky, given how the ArrayView model was custom-written for Awkward Array types and not Arrow Arrays.
|
cuDF only runs on GPUs as of now and there's no plan / roadmap for running on CPUs at this time, but what @shwina proposed here is to make the Numba pieces GPU/CPU agnostic so everyone benefits. Ideally we could live in a world where Arrow, Awkward, cuDF, etc. can all reuse the same Numba extensions.
The buffer backing cuDF columns are not Arrow views, but are our own
The goal of reusing the Numba extension for GPU/CPU is shared among us. I think the new goal we're proposing here is figuring out how to reuse the Numba extension across different projects so we can all contribute to a single place and benefit from each others work.
Rolling our own implementation for cuDF is the fallback plan, but we have a vested interested in improving the GPU ecosystem 😄. Unfortunately UDFs are pretty important for us to support in cuDF and Awkward is a bit too heavy of a dependency for us to depend on for UDFs.
I think this goes back into not locking us into specific containers to allow for ease of adoption, and instead using protocols and/or abstract classes to handle the Numba extensions. |
As I understand it, cuDF is getting a Numba extension now, but I won't be ready to do this for Awkward Array for months. I'll use a lot of the extensions @gmarkall is adding to Numba-CUDA to implement the Awkward one. The overlap is smaller than we had thought because cuDF's internal data model is strictly Arrow; Awkward Array's is not. As a generalization, there are additional features I have to implement for the Awkward one. On the other hand, I'm still very interested in interoperability projects in the future! |
Greetings,
awkward
devs!Over at cudf, we are introducing a ListDtype and an associated ListColumn that is similar to awkard's jagged array - just for use with DataFrames and related operations. We're also looking to introduce other "awkward" column types in the future, such as a
StructColumn
analogous to Arrow's StructArray.Something we'd like to be able to do is leverage Numba/CUDA to run user-defined functions (UDFs) on ListColumns -- pretty much exactly what is discussed here.
It seems like some redundancy between cuDF and awkward could be avoided here, by building out the required Numba extensions in a way that's easily reusable by both libraries. More importantly, it would lead to a better experience for users, as the same UDFs would run identically on both awkward arrays and cuDF.
Opening this issue to hear your thoughts about this, and as a place to collect ideas on how this might be achieved. Thanks!
cc: @kkraus14 @gmarkall
The text was updated successfully, but these errors were encountered: