Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pandas NAType #2931

Closed
xinyuejohn opened this issue Jan 10, 2024 · 6 comments
Closed

Support for pandas NAType #2931

xinyuejohn opened this issue Jan 10, 2024 · 6 comments
Labels
feature New feature or request

Comments

@xinyuejohn
Copy link

Description of new feature

Hi, I was creating an awkward array using pandas dataframe and I found awkward doesn't support pandas._libs.missing.NAType()

It would be great if this NAType could be supported.

To replicate:

import awkward as ak
import pandas

a = ["a", "b", pandas._libs.missing.NAType()]
ak.Array(a)

Traceback:

ValueError: cannot convert <NA> (type NAType) to an array element

(https://github.com/scikit-hep/awkward/blob/awkward-cpp-26/awkward-cpp/src/python/content.cpp#L191)

This error occurred while calling

    ak.to_layout(
        ['a', 'b', <NA>]
        allow_record = False
        regulararray = False
        primitive_policy = 'error'
    )
@xinyuejohn xinyuejohn added the feature New feature or request label Jan 10, 2024
@ianna
Copy link
Collaborator

ianna commented Jan 10, 2024

@xinyuejohn - thanks for reporting the issue! Looking at the Pandas NA type doc linked here, I think, NAType should be converted to Python None.

@jpivarski
Copy link
Member

In our meeting, we talked about this. I was worried about having too many different ways of expressing "missing value," since Awkward has option-types and Pandas has NAType. (I wonder the same thing about Pandas having both NAType and floating-point NaN, but that's because of historical reasons.)

  • Option 1: don't let NAType work with awkward-pandas and rely on Awkward's None, but make Pandas recognize None as missing in all the functions that do something special for missing values. (Examples include fillna, ffill, bfill, ...)
  • Option 2: make NAType work with awkward-pandas and have two ways of expressing "missing value," but Pandas only recognizes NAType in its functions. We would just have to explain the difference whenever it comes up.
  • Option 3: make NAType work with awkward-pandas and auto-convert Awkward Arrays when they get inserted into Awkward Series: the conversion would remove any option-type at the top level of an Awkward type tree and replace it with NAType, outside of the Awkward type tree.

Option 1 would involve a lot of work, and the Pandas API might not allow it. (They might not have a hook for us to tell Pandas, "these values are missing.") It would be a lot of work because Pandas has a lot of functions that do special things with missing values.

Option 2 would probably be confusing for users.

That's why I would vote for option 3. The option-type removal and replacement with NAType could perhaps happen in the awkward_pandas.AwkwardExtensionArray constructor. It would only check for top-level nullability, which can come in two ways:

layout.is_option

and

layout.is_union and any(x.is_option for x in layout.contents)

(Actually, I think the policy for option-type and union-type is that all of the union's contents are option-type, but only one of them is non-trivially so—not UnmaskedArray. So in the above, any could be replaced with all and it would still work. But any short-circuits as soon as it sees one option-type.)

The code that strips off the option-types would have to preserve indexes, and the project method does not preserve indexes. If the Pandas implementation of NAType is like a mask, then BitMaskedArray and ByteMaskedArray can just be replaced with their content and their mask can become the NAType's mask (converting bits to bytes in the case of BitMaskedArray, if it needs it). If NAType has an equivalent of IndexedOptionArray, then that can also be easily converted; if not, then the IndexedOptionArray can be converted to a masked array with to_ByteMaskedArray. The only other option-type node is UnmaskedArray, and that one's trivial (generate an empty mask if need be).

If the option-types are inside of a union, then this becomes more complicated for the one non-UnmaskedArray in the union's contents. I'm sure it's possible, but it would be complicated to describe in words here.


Finally, I think this discussion should move to the awkward-pandas library. I tried to transfer the issue, but I don't have permissions. @douglasdavis, do you have permissions? (I think one person would need to have Admin permissions on both repos.) If not, could you give me the permissions on awkward-pandas so that I can move this sort of issue in the future?

@douglasdavis
Copy link
Contributor

douglasdavis commented Jan 12, 2024

You should now have permission!

@xinyuejohn
Copy link
Author

@jpivarski thanks for your reply! I think option 3 is indeed more user-friendly!

@jpivarski
Copy link
Member

You should now have permission!

Odd. It still doesn't work.

image

Maybe this is the reason:

Note: You can only transfer issues between repositories owned by the same user or organization account. A private repository issue cannot be transferred to a public repository.

Okay, I'll do it manually.

@jpivarski
Copy link
Member

The issue has been moved to intake/akimbo#44, so I'm closing it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants