Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 2772 pyarrow doesnt permit selective reading with extensionarray #3127

Merged

Conversation

tcawlfield
Copy link
Collaborator

Adding a module, awkward._connect.pyarrow_table_conv, to convert pyarrow tables with extensionarray types to ones without, storing metadata necessary to reconstruct the extensionarray types into the table-level metadata.

This allows reading select columns, as the tables in newly-written Parquet files will have native column types together with this metadata needed to convert back again on read.

Each object now includes "field_name".
Each composite object, including the table itself, is
represented as a list of objects.
This feels like it has more consistency and robustness.
This does not work yet. Table.cast() seems to fail me here. But the schema
generation appears to work so far.
* Replacing Table.cast() with custom function :replace_schema()
* Replacing dictionary of lambdas with function having elif isinstance chain
* This is currently very poorly tested
Handling all the pyarrow types that we use now, but
there are still errors converting native DictionaryType arrays to awkward-extension-types.
Turns out you need AwkwardArrowArray.from_storage to
create extension-type dictionary arrays.
Strange but I'm evidently not the first poor soul to bump into this.
The unit tests do not yet cover Parquet file operations,
which will likely be in the next commit.
Also expanded test_2772 a bit, trying to reproduce errors from test_1440.
But instead of reproducing these, I found new errors.
Ugh. Checking this in because it's just where I'm at right now.
Added new test for actually doing selective read.
This required changing the top-level metadata from list to json object.
Also fixed a bug, when converting table to navite, keep
any existing table metadata entries.
@agoose77
Copy link
Collaborator

Super excited to see this work @tcawlfield!

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good! The next step would be to make ak.to_arrow_table and ak.from_arrow_table (and, by extension, ak.to_parquet and ak.from_parquet) do this automatically. Then a lot of the existing tests in

  • tests/test_1125_to_arrow_from_arrow.py
  • tests/test_1154_arrow_tables_should_preserve_parameters.py
  • tests/test_1294_to_and_from_parquet.py
  • tests/test_1453_write_single_records_to_parquet.py
  • tests/test_2340_unknown_type_to_arrow_and_back.py
  • tests/test_2772_parquet_extn_array_metadata.py

would verify that the metadata is being fully preserved and arrays are being properly reconstructed.

I tried this:

>>> arrow_table = ak.to_arrow_table(   # tuples need metadata
...     ak.Array([[(1, 1.1), (2, 2.2)], [], [(3, 3.3)]]),
...     extensionarray=True,
... )
>>> ak._connect.pyarrow_table_conv.convert_awkward_arrow_table_to_native(
...     arrow_table
... )

because this function takes a pa.Table and returns a pa.Table, but it didn't work:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 37, in convert_awkward_arrow_table_to_native
    return replace_schema(aatable, new_schema)
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 182, in replace_schema
    for col, new_field in zip(batch.itercolumns(), new_schema):
AttributeError: 'pyarrow.lib.RecordBatch' object has no attribute 'itercolumns'. Did you mean: 'num_columns'?

(I wanted to see what the collected metadata looks like.) Maybe I'm pulling the wrong function, though. (This whole new submodule consists of internal functions, so they don't have to have an obvious way to use them. They can be eclectic.)

This is looking good, but the real test will be when it's applied to all Awkward ↔ Arrow Table conversions, which will verify that the full conversions are lossless and correct.

src/awkward/_connect/pyarrow_table_conv.py Outdated Show resolved Hide resolved
src/awkward/_connect/pyarrow_table_conv.py Outdated Show resolved Hide resolved
src/awkward/_connect/pyarrow_table_conv.py Outdated Show resolved Hide resolved
tcawlfield and others added 2 commits May 23, 2024 10:03
Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>
@tcawlfield
Copy link
Collaborator Author

It's looking good! The next step would be to make ak.to_arrow_table and ak.from_arrow_table (and, by extension, ak.to_parquet and ak.from_parquet) do this automatically. Then a lot of the existing tests in

  • tests/test_1125_to_arrow_from_arrow.py
  • tests/test_1154_arrow_tables_should_preserve_parameters.py
  • tests/test_1294_to_and_from_parquet.py
  • tests/test_1453_write_single_records_to_parquet.py
  • tests/test_2340_unknown_type_to_arrow_and_back.py
  • tests/test_2772_parquet_extn_array_metadata.py

would verify that the metadata is being fully preserved and arrays are being properly reconstructed.

I tried this:

>>> arrow_table = ak.to_arrow_table(   # tuples need metadata
...     ak.Array([[(1, 1.1), (2, 2.2)], [], [(3, 3.3)]]),
...     extensionarray=True,
... )
>>> ak._connect.pyarrow_table_conv.convert_awkward_arrow_table_to_native(
...     arrow_table
... )

because this function takes a pa.Table and returns a pa.Table, but it didn't work:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 37, in convert_awkward_arrow_table_to_native
    return replace_schema(aatable, new_schema)
  File "/Users/jpivarski/irishep/awkward/src/awkward/_connect/pyarrow_table_conv.py", line 182, in replace_schema
    for col, new_field in zip(batch.itercolumns(), new_schema):
AttributeError: 'pyarrow.lib.RecordBatch' object has no attribute 'itercolumns'. Did you mean: 'num_columns'?

(I wanted to see what the collected metadata looks like.) Maybe I'm pulling the wrong function, though. (This whole new submodule consists of internal functions, so they don't have to have an obvious way to use them. They can be eclectic.)

This is looking good, but the real test will be when it's applied to all Awkward ↔ Arrow Table conversions, which will verify that the full conversions are lossless and correct.

This should have worked. It works for me. Perhaps you found the first issue w/rt differing pyarrow versions? Or maybe I had failed to push local commits yesterday? I pushed some commits just now -- curious what this does for you now.

@tcawlfield tcawlfield marked this pull request as ready for review May 23, 2024 22:00
@tcawlfield
Copy link
Collaborator Author

I'm marking this as ready for review, even though there's failures on Python 3.8 still. I'll look into what it takes to reproduce this, but so far I'm hitting errors like:

AttributeError("'pyarrow.lib.LargeListType' object has no attribute 'field'")

that come specifically from here.

I'm having trouble installing all of requirements-test-minimal.txt in a Python 3.11 environment, although I got pyarrow 7.0.0 installed. And indeed LargeListArray in that version has a value_field property but not a field() method. I can probably special-case that with hasattr, but without reproducing the whole environment I'm not sure what would fail next. So I'll pause here and wait for any advice you might offer, and in the meantime try to get this working with an old Python somewhere.

You'll quickly notice that I finally broke down and converted ._connect.pyarrow into a package, with my new code as a submodule, .table_conv. I did a very minimal restructuring of the rest of pyarrow.py -- top-level __init__.py just handles safe importing with/without pyarrow and creates the same namespace of symbols regardless, so that other code could import anything here without having to get too fancy. The extension classes went into a separate submodule because that part was easy to chop off and it felt reasonable to do so. Everything else went into a grab-bag, .../pyarrow/conversions.py. (I'm happy to rename this.) I felt this was a safe place to draw the line between minor and major rearranging. My primary motivation was to pull the new table-conversion functions, convert_awkward_arrow_table_to_native and convert_native_arrow_table_to_awkward, up into ..._connect.pyarrow in a way that kept the new code separate, avoided circular imports, and worked well with/without pyarrow.

@jpivarski
Copy link
Member

And indeed LargeListArray in that version has a value_field property but not a field() method. I can probably special-case that with hasattr, but without reproducing the whole environment I'm not sure what would fail next.

Actually, I remember that issue with LargeListArray. I might have even reported it, but it's probably from back when Arrow used JIRA instead of GitHub and I'd have trouble finding it. Special-casing with hasattr is definitely fine, and it's an isolated thing. There's no reason to expect a deluge of issues afterward.

Is the difficulty of making an environment with pyarrow 7 related to conda versus pip venv (because pyarrow has a binary dependency)? Making environments with the minimal and maximal dependencies works with conda (mamba).

Reorganizing all of the pyarrow connections into a directory is a good idea. Thanks!

Moving to_awkwardarrow_storage_types from .conversions to .extn_types.
@tcawlfield
Copy link
Collaborator Author

Okay I think this is all ready for review now, with tests passing!

Is the difficulty of making an environment with pyarrow 7 related to conda versus pip venv (because pyarrow has a binary dependency)? Making environments with the minimal and maximal dependencies works with conda (mamba).

My issue with pyarrow 7.0.0 is resolved. I'm not sure why I had trouble installing the older requirements earlier. It was a compilation error of some sort, but I was able to grab old pyarrow but latest versions of the other dependencies in our requirements-minimal.txt (something like that anyway) and work out all the changes needed. I left some earlier code commented-out with the rationale that we might in the future drop support for older pyarrow versions, and can clean up and/or improve the code. Some people don't like checking in commented-out code. "That's what git is for!" And this is true but I have things commented-out that might be worth reviewing in the future, or might also make intentions more clear. I do try to keep this to a bare minimum though.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! I think this is ready to merge, if updating to main is uneventful (i.e. it merges cleanly and all tests continue to pass).

src/awkward/_connect/pyarrow/__init__.py Show resolved Hide resolved
src/awkward/_connect/pyarrow/conversions.py Outdated Show resolved Hide resolved
src/awkward/_connect/pyarrow/extn_types.py Show resolved Hide resolved
src/awkward/_connect/pyarrow/table_conv.py Show resolved Hide resolved
tests/test_2772_parquet_extn_array_metadata.py Outdated Show resolved Hide resolved
The failing test is being moved to a new file, to be added later.
@tcawlfield tcawlfield merged commit ef2e08f into main May 27, 2024
41 checks passed
@tcawlfield tcawlfield deleted the 2772-pyarrow-doesnt-permit-selective-reading-with-extensionarray branch May 27, 2024 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyArrow doesn't permit selective reading with ExtensionArray
3 participants