Support for pandas NAType #2931

xinyuejohn · 2024-01-10T16:06:44Z

Description of new feature

Hi, I was creating an awkward array using pandas dataframe and I found awkward doesn't support pandas._libs.missing.NAType()

It would be great if this NAType could be supported.

To replicate:

import awkward as ak
import pandas

a = ["a", "b", pandas._libs.missing.NAType()]
ak.Array(a)

Traceback:

ValueError: cannot convert <NA> (type NAType) to an array element

(https://github.com/scikit-hep/awkward/blob/awkward-cpp-26/awkward-cpp/src/python/content.cpp#L191)

This error occurred while calling

    ak.to_layout(
        ['a', 'b', <NA>]
        allow_record = False
        regulararray = False
        primitive_policy = 'error'
    )

The text was updated successfully, but these errors were encountered:

ianna · 2024-01-10T19:41:06Z

@xinyuejohn - thanks for reporting the issue! Looking at the Pandas NA type doc linked here, I think, NAType should be converted to Python None.

jpivarski · 2024-01-11T17:28:40Z

In our meeting, we talked about this. I was worried about having too many different ways of expressing "missing value," since Awkward has option-types and Pandas has NAType. (I wonder the same thing about Pandas having both NAType and floating-point NaN, but that's because of historical reasons.)

Option 1: don't let NAType work with awkward-pandas and rely on Awkward's None, but make Pandas recognize None as missing in all the functions that do something special for missing values. (Examples include fillna, ffill, bfill, ...)
Option 2: make NAType work with awkward-pandas and have two ways of expressing "missing value," but Pandas only recognizes NAType in its functions. We would just have to explain the difference whenever it comes up.
Option 3: make NAType work with awkward-pandas and auto-convert Awkward Arrays when they get inserted into Awkward Series: the conversion would remove any option-type at the top level of an Awkward type tree and replace it with NAType, outside of the Awkward type tree.

Option 1 would involve a lot of work, and the Pandas API might not allow it. (They might not have a hook for us to tell Pandas, "these values are missing.") It would be a lot of work because Pandas has a lot of functions that do special things with missing values.

Option 2 would probably be confusing for users.

That's why I would vote for option 3. The option-type removal and replacement with NAType could perhaps happen in the awkward_pandas.AwkwardExtensionArray constructor. It would only check for top-level nullability, which can come in two ways:

layout.is_option

and

layout.is_union and any(x.is_option for x in layout.contents)

(Actually, I think the policy for option-type and union-type is that all of the union's contents are option-type, but only one of them is non-trivially so—not UnmaskedArray. So in the above, any could be replaced with all and it would still work. But any short-circuits as soon as it sees one option-type.)

The code that strips off the option-types would have to preserve indexes, and the project method does not preserve indexes. If the Pandas implementation of NAType is like a mask, then BitMaskedArray and ByteMaskedArray can just be replaced with their content and their mask can become the NAType's mask (converting bits to bytes in the case of BitMaskedArray, if it needs it). If NAType has an equivalent of IndexedOptionArray, then that can also be easily converted; if not, then the IndexedOptionArray can be converted to a masked array with to_ByteMaskedArray. The only other option-type node is UnmaskedArray, and that one's trivial (generate an empty mask if need be).

If the option-types are inside of a union, then this becomes more complicated for the one non-UnmaskedArray in the union's contents. I'm sure it's possible, but it would be complicated to describe in words here.

Finally, I think this discussion should move to the awkward-pandas library. I tried to transfer the issue, but I don't have permissions. @douglasdavis, do you have permissions? (I think one person would need to have Admin permissions on both repos.) If not, could you give me the permissions on awkward-pandas so that I can move this sort of issue in the future?

douglasdavis · 2024-01-12T04:09:33Z

You should now have permission!

xinyuejohn · 2024-01-12T13:49:52Z

@jpivarski thanks for your reply! I think option 3 is indeed more user-friendly!

jpivarski · 2024-01-12T14:37:47Z

You should now have permission!

Odd. It still doesn't work.

Maybe this is the reason:

Note: You can only transfer issues between repositories owned by the same user or organization account. A private repository issue cannot be transferred to a public repository.

Okay, I'll do it manually.

jpivarski · 2024-01-12T14:40:50Z

The issue has been moved to intake/akimbo#44, so I'm closing it here.

xinyuejohn added the feature New feature or request label Jan 10, 2024

jpivarski mentioned this issue Jan 12, 2024

Support for pandas NAType intake/akimbo#44

Open

jpivarski closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for pandas NAType #2931

Support for pandas NAType #2931

xinyuejohn commented Jan 10, 2024

ianna commented Jan 10, 2024

jpivarski commented Jan 11, 2024

douglasdavis commented Jan 12, 2024 •

edited

Loading

xinyuejohn commented Jan 12, 2024

jpivarski commented Jan 12, 2024

jpivarski commented Jan 12, 2024

Support for pandas NAType #2931

Support for pandas NAType #2931

Comments

xinyuejohn commented Jan 10, 2024

Description of new feature

ianna commented Jan 10, 2024

jpivarski commented Jan 11, 2024

douglasdavis commented Jan 12, 2024 • edited Loading

xinyuejohn commented Jan 12, 2024

jpivarski commented Jan 12, 2024

jpivarski commented Jan 12, 2024

douglasdavis commented Jan 12, 2024 •

edited

Loading