-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDataFrame integration #588
Comments
I add myself here, as discussed over email to help with this project. |
Argh, never replied, sorry! A RDataSource to feed awkward arrays into RDF is the correct thing to use, and to output awkward arrays a custom action can be added to RDF using the Book method (and a nice alias can be monkey-patched via Python). |
I didn't know about The last action would also need to be non-lazy, so I guess I'll need something after |
def AsAwkward(dataframe, column):
awkArrMaker = AwkwardArrayMaker()
res = df.Book(awkArrMaker, column).GetValue() # calling GetValue right away makes this eager
# res is now some C++ struct exposed to Python via PyROOT
# you can use it from Python or invoke a helper C++ function that converts C++ arrays into awkward arrays, or whatever you want |
That is helpful, thanks! I will want to do a part of the translation on the C++ side, in order to make C++ native objects like
Steps 1‒2 are independent of steps 4‒5. Someone might do one, the other, or both. There's something I hadn't considered about step 4, though: I need to inspect the C++ output types of a node in the RDataFrame DAG from Python. The reflection information must be available for RDataFrame to work, but is it publicly accessible? For instance, given a Python object representing a node in an RDataFrame (ROOT.RDF.RInterface?), is it possible to get the names of defined columns and their types? How are their types represented (ROOT.TClass? if so, what about numbers and STL collections)? That's what I'd need to know to generate the appropriate C++ when an |
RDF has a
|
A string is probably best, anyway. I can parse templates to get all the STL structures, and if I encounter any opaque class names, I can look them up with As for it not being fast, the type lookup, code generation, and Cling compilation are all once-per-array operations, rather than once-per-element, and I generally only worry about the performance of the latter. (I.e. we want the scaling to have a good slope, but are less concerned with the offset: defining the goal to optimize infinitely large arrays.) Since RDF also does a one-time compilation step, this isn't very different from what you do. If I make ak.forms.Form hashable, to put in a global dict, it can be a once-per-type-of-array operation: repeatedly applying arrays of the same type could reuse previously generated and compiled interfaces. Oh, and I should make the interface to collections be |
If I may step in and generate a bit of noise, I'd support this on principle alone. I think many analyses are moving forward in frameworks being built on uproot+awkward or RDataFrame, and I agree having interoperability will be a big boon for analyses and analyzers. As I'm careening towards thesis writing and graduation, I'm not sure I have much bandwidth for meaningful contributions, but I'll be watching (and that may perhaps change by the time this gets attention from whoever will work on it). I can say that I wish this existed right now, I would definitely have uses for converting from my nearly-complete analysis based on RDF, to awkward, and back again. Interfacing with industry standard tools and already trained models for ML and round-tripping back to root output, without introducing even more bottle-necking intermediate stages, falling back to single-threaded execution, or rewriting things for the complete flattening I'd need using AsNumpy/MakeNumpyDataFrame, would be fantastic. Which raises a question from me, Jim, do you envision this as a bulk operation like AsNumpy, where all columns/rows are held in memory at once, or something that can operate on chunks at a time? That may be something for much later, once a working implementation can take an entire dataset from one to the other and back. |
Good to know there's interest! If it's implemented as a bulk operation, we can get the chunk-at-a-time functionality from that. The natural implementation for partitioned arrays (ak.partitioned) would be to step through the partitions, running the RDF bridge for each partition at a time. If the partitioned array is also virtual (i.e. it's a lazy array) and the RDF actions fill histograms or write to ROOT files (i.e. do not generate growing in-memory output), then such a process would never have more than one chunk of data in memory (for a suitably chosen lazy cache). That statement has a lot of qualifiers on it, but some examples of "best practices" could indicate how to do it within a memory budget. |
FYI, both
Here is an example how to use it: def test_data_frame_vec_of_real():
ak_array_in = ak._v2.Array([[1.1, 2.2], [3.3], [4.4, 5.5]])
data_frame = ak._v2.to_rdataframe({"x": ak_array_in})
assert data_frame.GetColumnType("x") == "ROOT::VecOps::RVec<double>"
ak_array_out = ak._v2.from_rdataframe(
data_frame,
column="x",
)
assert ak_array_in.to_list() == ak_array_out["x"].to_list() Please, let me know if there are any issues, or requests. Thanks! |
This issue is to collect my thoughts about how RDataFrame integration could be done. Such a thing would be useful because physicists could then mix analyses using Awkward Array, Numba, and ROOT C++ without leaving their environment. The benefits compound:
MakeRootDataFrame
and dumped into an Awkward Array.Snapshot
.Should we use Arrow? No.
In principle, we should be able to convert Awkward Arrays to and from RDataFrame using Apache Arrow, but there only seems to be an RArrowDS and not an action that converts back to Arrow (like
AsNumpy
), and even the RArrowDS only forwards a reference toarrow::Table
—trying toinclude
the Arrow headers from my conda-installed version of ROOT fails. Perhaps the version of ROOT in conda-forge is not linked to Arrow. If Arrow access is not consistent across ROOT installations (i.e. it's a compile-time option or something), I don't want to rely on it—it will fail for too many users.Besides, I didn't see an example of RArrowDS for anything but primitive types: if a jagged array is passed from Arrow to RDataFrame, how does it appear to a physicist writing code for a
Define
action? Like strangearrow::
stuff? Taking this and the above consideration together, I don't think Arrow is the right route for Awkward → RDataFrame, despite first appearances.Code generation to Awkward → RDataFrame
But code-generation could be a good way to go. Each distinct (different Form) attempt to create an RDataFrame from an Awkward Array (probably named
ak.to_rdataframe
) could try to import ROOT (no explicit dependence), create a string of C++ code, and run it throughROOT.gInterpreter.Define
(with unique names, to permit repeated definitions). Any nested records could be declared as new structs so that nested fields can be accessed asrecord.fieldName
(as long as field names are in[A-Za-z_][A-Za-z_0-9]*
). Any nested lists could be emplaced intoROOT::VecOps::RVec
. There's a performance penalty for converting columnar data into record-oriented data like this, but most physics use-cases don't run at such high speeds that this would be the bottleneck.The code to generate would look a lot like this (shamelessly stolen from
RTrivialDS
with "Trivial" changed to "Wonky"):RDataFrame → Awkward
For the other direction, creating an Awkward Array from RDataFrame, it would perhaps be most natural to import the ArrayBuilder functions into the ROOT C++ context through function pointers, exactly as they have been imported into the Numba context. They could then be used by users in an RDataFrame
Foreach
action—which is decidedly non-functional, but that's how it's done in Numba—or we could monkey-patch anAsAwkward
action ontoRInterface
that generate the arrays of the Awkward layout as columns and provide them to Python throughAsNumpy
. It can be exactly the format required by ak.from_arrayset.It seems that monkey-patching is indeed possible:
Even for objects made before the monkey-patching:
So this allows us to fully decouple installation requirements (ROOT does not depend on Awkward Array and Awkward Array does not depend on ROOT) while allowing us to add an
AsAwkward
action to everyRInterface
in PyROOT. The downside is that it has to be installed by some Awkward import, such asWithout this import, the RDataFrame nodes would not have an
AsAwkward
action. Numba avoids this by checking for entry points in setuptools, but doing that here would require a change to ROOT. If this project goes well and people find it useful, then adding an entry points mechanism to PyROOT would be better motivated.People who might be interested
@eguiraud and @etejedor, probably! The above project is speculative at this point and no one has asked for it. I just got to thinking that such an interface should exist, since it would multiply the possibilities available to physicists doing analysis.
The text was updated successfully, but these errors were encountered: