-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127
feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127
Conversation
Each object now includes "field_name". Each composite object, including the table itself, is represented as a list of objects. This feels like it has more consistency and robustness.
This does not work yet. Table.cast() seems to fail me here. But the schema generation appears to work so far.
* Replacing Table.cast() with custom function :replace_schema() * Replacing dictionary of lambdas with function having elif isinstance chain * This is currently very poorly tested
Handling all the pyarrow types that we use now, but there are still errors converting native DictionaryType arrays to awkward-extension-types.
Turns out you need AwkwardArrowArray.from_storage to create extension-type dictionary arrays. Strange but I'm evidently not the first poor soul to bump into this.
The unit tests do not yet cover Parquet file operations, which will likely be in the next commit.
Also expanded test_2772 a bit, trying to reproduce errors from test_1440. But instead of reproducing these, I found new errors. Ugh. Checking this in because it's just where I'm at right now.
Added new test for actually doing selective read. This required changing the top-level metadata from list to json object. Also fixed a bug, when converting table to navite, keep any existing table metadata entries.
Super excited to see this work @tcawlfield! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking good! The next step would be to make ak.to_arrow_table
and ak.from_arrow_table
(and, by extension, ak.to_parquet
and ak.from_parquet
) do this automatically. Then a lot of the existing tests in
- tests/test_1125_to_arrow_from_arrow.py
- tests/test_1154_arrow_tables_should_preserve_parameters.py
- tests/test_1294_to_and_from_parquet.py
- tests/test_1453_write_single_records_to_parquet.py
- tests/test_2340_unknown_type_to_arrow_and_back.py
- tests/test_2772_parquet_extn_array_metadata.py
would verify that the metadata is being fully preserved and arrays are being properly reconstructed.
I tried this:
>>> arrow_table = ak.to_arrow_table( # tuples need metadata
... ak.Array([[(1, 1.1), (2, 2.2)], [], [(3, 3.3)]]),
... extensionarray=True,
... )
>>> ak._connect.pyarrow_table_conv.convert_awkward_arrow_table_to_native(
... arrow_table
... )
because this function takes a pa.Table
and returns a pa.Table
, but it didn't work:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 37, in convert_awkward_arrow_table_to_native
return replace_schema(aatable, new_schema)
File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 182, in replace_schema
for col, new_field in zip(batch.itercolumns(), new_schema):
AttributeError: 'pyarrow.lib.RecordBatch' object has no attribute 'itercolumns'. Did you mean: 'num_columns'?
(I wanted to see what the collected metadata looks like.) Maybe I'm pulling the wrong function, though. (This whole new submodule consists of internal functions, so they don't have to have an obvious way to use them. They can be eclectic.)
This is looking good, but the real test will be when it's applied to all Awkward ↔ Arrow Table conversions, which will verify that the full conversions are lossless and correct.
Fixes failures in test 2968
Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>
This should have worked. It works for me. Perhaps you found the first issue w/rt differing pyarrow versions? Or maybe I had failed to push local commits yesterday? I pushed some commits just now -- curious what this does for you now. |
This makes things more natural to use from outside, with or without pyarrow installed.
I'm marking this as ready for review, even though there's failures on Python 3.8 still. I'll look into what it takes to reproduce this, but so far I'm hitting errors like:
that come specifically from here. I'm having trouble installing all of requirements-test-minimal.txt in a Python 3.11 environment, although I got pyarrow 7.0.0 installed. And indeed LargeListArray in that version has a You'll quickly notice that I finally broke down and converted |
Actually, I remember that issue with LargeListArray. I might have even reported it, but it's probably from back when Arrow used JIRA instead of GitHub and I'd have trouble finding it. Special-casing with Is the difficulty of making an environment with pyarrow 7 related to conda versus pip venv (because pyarrow has a binary dependency)? Making environments with the minimal and maximal dependencies works with conda (mamba). Reorganizing all of the pyarrow connections into a directory is a good idea. Thanks! |
Moving to_awkwardarrow_storage_types from .conversions to .extn_types.
Okay I think this is all ready for review now, with tests passing!
My issue with pyarrow 7.0.0 is resolved. I'm not sure why I had trouble installing the older requirements earlier. It was a compilation error of some sort, but I was able to grab old pyarrow but latest versions of the other dependencies in our requirements-minimal.txt (something like that anyway) and work out all the changes needed. I left some earlier code commented-out with the rationale that we might in the future drop support for older pyarrow versions, and can clean up and/or improve the code. Some people don't like checking in commented-out code. "That's what git is for!" And this is true but I have things commented-out that might be worth reviewing in the future, or might also make intentions more clear. I do try to keep this to a bare minimum though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! I think this is ready to merge, if updating to main
is uneventful (i.e. it merges cleanly and all tests continue to pass).
The failing test is being moved to a new file, to be added later.
Adding a module,
awkward._connect.pyarrow_table_conv
, to convert pyarrow tables with extensionarray types to ones without, storing metadata necessary to reconstruct the extensionarray types into the table-level metadata.This allows reading select columns, as the tables in newly-written Parquet files will have native column types together with this metadata needed to convert back again on read.