Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support conversion to arrow and back with non-option Unknown type #3085

Merged

Conversation

tcawlfield
Copy link
Collaborator

When we convert an Awkward array of type Unknown, this becomes
an arrow array (even prior to this) with zero length, nullable-null type,
but with Awkward metadata indicating no mask (non-option-type).
I believe in this case we can be sure that the original Awkward array was an EmptyArray.

@tcawlfield tcawlfield linked an issue Apr 19, 2024 that may be closed by this pull request
The unit tests required checking for nullable sub-arrays, as only these would be valid for Parquet etc.
To properly disambiguate the nullable Arrow arrays, we require adding more metadata.
In this case, I'm adding "is_empty_array" to node_parameters.

I'm not yet confident that all code paths are tested and are correct yet. Other enclosing array types likely need support.
* EmptyArray no longer abuses node_parameters but instead gets its own ArrowArrayType property,
_is_nonnullable_nulltype.
On conversion to arrow, instead of <type>.is_option we test <type>._arrow_needs_option_type()
This change seems to be about code consistency only. I did not find any way to test this change.
A table column must be a list type capable of having a non-zero number of columns.
An EmptyArray column type doesn't make any sense.
Added unit test for this. And ammending previous commit with a similar unit test for ak_to_arrow_array!
This intentionally fails with this commit.
Also dressing up RecordArray tests a little just for fun.
@tcawlfield tcawlfield self-assigned this Apr 24, 2024
It might be better to test pyarrow version instead of hasattr(...) though.
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only really important issue is on emptyarray.py new line 376: you're setting is_nonnullable_nulltype=True for every EmptyArray.

If it's originally a non-optional unknowntype, you want to wrap it in UnmaskedArray and set is_nonnullable_nulltype=True so that when it gets read back from Arrow, the fake UnmaskedArray gets taken off again.

If it's originally a ?unknowntype (the ? is option-type), you want to not wrap it in UnmaskedArray (it's already in an option-tyep) and set is_nonnullable_nulltype=True so that when it gets read back from Arrow, nothing gets taken off again.

To ensure that this behavior gets tested, you should have some originally ?unknowntype examples. You're currently testing

  • [] of unknowntype
  • [[], [], []] of var * unknowntype

Add to that

  • [] of ?unknowntype: you can make it with ak.Array([None])[0:0]
  • [[], [], []] of var * ?unknowntype: you can make it with ak.Array([[], [], [], [None]])[0:3].

The rest of my comments are minor coding style preferences.

src/awkward/_connect/pyarrow.py Outdated Show resolved Hide resolved
src/awkward/_connect/pyarrow.py Outdated Show resolved Hide resolved
src/awkward/_connect/pyarrow.py Outdated Show resolved Hide resolved
src/awkward/contents/emptyarray.py Outdated Show resolved Hide resolved
src/awkward/contents/emptyarray.py Outdated Show resolved Hide resolved
src/awkward/contents/emptyarray.py Outdated Show resolved Hide resolved
src/awkward/contents/emptyarray.py Outdated Show resolved Hide resolved
src/awkward/_meta/meta.py Outdated Show resolved Hide resolved
src/awkward/contents/emptyarray.py Outdated Show resolved Hide resolved
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done! I checked the code carefully and I tested a few manually constructed arrays:

>>> import numpy as np
>>> import awkward as ak
>>> from awkward.contents import EmptyArray, UnmaskedArray, ListOffsetArray
>>> from awkward.index import Index64
>>>
>>> ak.from_arrow(ak.to_arrow(EmptyArray(), extensionarray=False), highlevel=False)
<IndexedOptionArray len='0'>
    <index><Index dtype='int64' len='0'>[]</Index></index>
    <content><EmptyArray len='0'/></content>
</IndexedOptionArray>
>>>
>>> ak.from_arrow(ak.to_arrow(EmptyArray(), extensionarray=True), highlevel=False)
<EmptyArray len='0'/>

That's good: with extensionarray=False, it can't round-trip: the EmptyArray comes back as an option-type (IndexedOptionArray) EmptyArray, but that's because it has nowhere to store the metadata. Meanwhile, the extensionarray=True case does round-trip: it's using that metadata.

Same thing inside of a list:

>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), EmptyArray()), extensionarray=False), highlevel=False)
<ListOffsetArray len='0'>
    <offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
    <content><IndexedOptionArray len='0'>
        <index><Index dtype='int64' len='0'>
            []
        </Index></index>
        <content><EmptyArray len='0'/></content>
    </IndexedOptionArray></content>
</ListOffsetArray>
>>>
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), EmptyArray()), extensionarray=True), highlevel=False)
<ListOffsetArray len='0'>
    <offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
    <content><EmptyArray len='0'/></content>
</ListOffsetArray>

Without an extensionarray, it can't store the metadata to know that the EmptyArray originally wasn't wrapped in any option-type node, but with an extensionarray, it knows that and reproduces the original type.

Now if we wrap the EmptyArray in an option-type (UnmaskedArray), it always comes back with the option-type, regardless of the extensionarray setting.

>>> ak.from_arrow(ak.to_arrow(UnmaskedArray(EmptyArray()), extensionarray=False), highlevel=False)
<IndexedOptionArray len='0'>
    <index><Index dtype='int64' len='0'>[]</Index></index>
    <content><EmptyArray len='0'/></content>
</IndexedOptionArray>
>>>
>>> ak.from_arrow(ak.to_arrow(UnmaskedArray(EmptyArray()), extensionarray=True), highlevel=False)
<IndexedOptionArray len='0'>
    <index><Index dtype='int64' len='0'>[]</Index></index>
    <content><EmptyArray len='0'/></content>
</IndexedOptionArray>
>>>
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), UnmaskedArray(EmptyArray())), extensionarray=False), highlevel=False)
<ListOffsetArray len='0'>
    <offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
    <content><IndexedOptionArray len='0'>
        <index><Index dtype='int64' len='0'>
            []
        </Index></index>
        <content><EmptyArray len='0'/></content>
    </IndexedOptionArray></content>
</ListOffsetArray>
>>>
>>> ak.from_arrow(ak.to_arrow(ListOffsetArray(Index64(np.array([0])), UnmaskedArray(EmptyArray())), extensionarray=True), highlevel=False)
<ListOffsetArray len='0'>
    <offsets><Index dtype='int64' len='1'>[0]</Index></offsets>
    <content><IndexedOptionArray len='0'>
        <index><Index dtype='int64' len='0'>
            []
        </Index></index>
        <content><EmptyArray len='0'/></content>
    </IndexedOptionArray></content>
</ListOffsetArray>

The unit tests test these cases as well, but not explicitly with UnmaskedArray. (Constructing it from Python lists makes the option-type with IndexedOptionArray, but the code isn't hard-coded to any particular node type, just the option-ness.)

So I think this is ready to merge!

@tcawlfield tcawlfield merged commit 21af5dc into main Apr 29, 2024
40 of 41 checks passed
@tcawlfield tcawlfield deleted the 2340-awkward-allows-non-nullable-unknown-type-but-arrow-doesnt branch April 29, 2024 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Awkward allows non-nullable unknown type, but Arrow doesn't
2 participants