Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toarrow and fromarrow #68

Closed
jpivarski opened this issue Jan 14, 2020 · 5 comments · Fixed by #263
Closed

toarrow and fromarrow #68

jpivarski opened this issue Jan 14, 2020 · 5 comments · Fixed by #263
Assignees
Labels
feature New feature or request

Comments

@jpivarski
Copy link
Member

jpivarski commented Jan 14, 2020

Similar to the Awkward ↔ Arrow conversions in Awkward0, except in C++, rather than Python.

It's a recursive if-elseif-elseif-...-else chain down the list of array node types, replacing each from one library with its equivalent in the other. Conversions from Arrow → Awkward can be zero-copy, now that BitMaskedArray exists, and conversions from Awkward → Arrow would involve one copy (to move the disparate buffers into Arrow's single buffer format).

@jpivarski jpivarski self-assigned this Jan 14, 2020
@jpivarski jpivarski added the feature New feature or request label Jan 14, 2020
@jpivarski jpivarski added this to the Needed for Arrow/Parquet milestone Jan 14, 2020
@jpivarski jpivarski changed the title toarrow and fromarrow in C++ toarrow and fromarrow Mar 4, 2020
@jpivarski
Copy link
Member Author

Most expected users of Awkward ↔ Arrow conversion are using Python, not C++. Writing a converter in C++ is not a performance consideration, but an accessibility one—it allows pure C++ programs to exchange structured data with other programs in the Arrow ecosystem. However, writing it would mean spinning off a separate package, since Awkward can't take on Arrow as a dependency; the Awkward-Arrow package would have to depend on both Awkward and Arrow, which is too much to deal with right now.

For the time being, Awkward ↔ Arrow conversion will be a Python function. (And that Python function can continue to exist as a fallback if the Awkward-Arrow package isn't accessible).

@jpivarski
Copy link
Member Author

On second thought, there's this: https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst

We may have a minimal-dependency way to consume and produce Arrow buffers after all. (Need to check on the status of that from Arrow.)

@ianna
Copy link
Collaborator

ianna commented Mar 13, 2020

It looks promising.

@jpivarski
Copy link
Member Author

My reading of this (Arrow JIRA ticket and pull request) is that this human-readable specification is the entirety of the C interface. There's no code other than what we see on the instructions page. We're supposed to copy its struct definitions into our project, populate them according to the rules on the page, and that's it: the in-memory buffer we've just made is an Arrow buffer. It would be nice to see an example of wrapping that buffer in pyarrow and verifying that the data can be round-tripped, but once we (I or someone else) figure out how to do that, we can submit such an example as a PR to apache/arrow.

@jpivarski
Copy link
Member Author

I can't make it an "assignment," but @trickarcher is actively working on this.

@jpivarski jpivarski linked a pull request Apr 26, 2020 that will close this issue
@trickarcher trickarcher self-assigned this May 10, 2020
@jpivarski jpivarski linked a pull request May 14, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants