Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pandas Extension Arrays #5287

Closed
Hoeze opened this issue May 10, 2021 · 8 comments · Fixed by #8723
Closed

Support for pandas Extension Arrays #5287

Hoeze opened this issue May 10, 2021 · 8 comments · Fixed by #8723
Labels
topic-arrays related to flexible array support

Comments

@Hoeze
Copy link

Hoeze commented May 10, 2021

Is your feature request related to a problem? Please describe.
I started writing an ExtensionArray which is basically a Tuple[Array[str], Array[int], Array[int], Array[str], Array[str]].
Its scalar type is a Tuple[str, int, int, str, str].

This is working great in Pandas, I can read and write Parquet as well as csv with it.
However, as soon as I'm using any .to_xarray() method, it gets converted to a NumPy array of objects.
Also, converting back to Pandas keeps a Series of objects instead of my extension type.

Describe the solution you'd like
Would it be possible to support Pandas Extension Types on coordinates?
It's not necessary to compute anything on them, I'd just like to use them for dimensions.

Describe alternatives you've considered
I was thinking over implementing a NumPy duck array, but I have never tried this and it looks quite complicated compared to the Pandas Extension types.

@keewis
Copy link
Collaborator

keewis commented May 10, 2021

I think I remember reading somewhere that we want to keep being compatible with numpy, which means that we're waiting for NEP40-43 to be included in a release. As far as I can tell that might still take a while, though, the implementation is not quite there yet.

Edit: in any case, I think something like that would be really useful

@max-sixty
Copy link
Collaborator

If there were sufficient demand and development effort for pandas extension arrays, I think there's be interest in adding it without waiting for numpy, similar to how we handle dask / sparse arrays.

But I imagine it would be a decently sized project, and AFAIK no one from the existing core dev team has expressed interest in taking it on, so it would have to come from others. And it's probably a convex project that's only useful once it's completed — rather than marginally helpful with marginal improvements.

@dcherian
Copy link
Contributor

If they added NEP-18 support, many things would work automatically, wouldn't it?

xref pandas-dev/pandas#26380

Unfortunately, pandas-dev/pandas#35032 was closed.

@shoyer
Copy link
Member

shoyer commented May 11, 2021

If they added NEP-18 support, many things would work automatically, wouldn't it?

In my opinion, NEP-18 support is probably out of scope for pandas.

But this would totally make sense for a separate mini-project, to make a NumPy compatible wrapper of pandas extension arrays.

I see two possible levels of support here:

  1. Only 1D data, with NumPy's API. Operations that would produce multi-dimensional data raise an error.
  2. Support N-D data, on top of pandas' 1D API. This would make extension arrays more generally useful in Xarray, but some operations might be hard to do efficiently.

@jbrockmendel
Copy link

Unfortunately, pandas-dev/pandas#35032 was closed

I'm hoping to re-open at some point. The trouble I ran into is that a) there isn't any way to implement __array_function__ incrementally and b) there aren't any assurances on where self is among the args and kwargs passed to __array_function__. The workarounds I came up with for the latter were pretty ugly. Input would be welcome.

Keep in mind that PR implemented __array_function__ for NDArrayBackedExtensionArray (includes DatetimeArray, TimedeltaArray, PeriodArray, Categorical (and i expect most 3rd party EAs will be natural candidates)). Implementing it on the base ExtensionArray class would be a different animal.

Support N-D data, on top of pandas' 1D API. This would make extension arrays more generally useful in Xarray, but some operations might be hard to do efficiently

ATM NDArrayBackedExtensionArray explicitly supports 2D, and because it is a thin wrapper around np.ndarray higher-dimensions should either work or be within spitting distance of working.

I'm trying to get support for 2D more generally (xref pandas-dev/pandas#38992), but at best it will be a while before that becomes a reality.

@ivirshup
Copy link

Sorry for the necrobump (let me know if I should comment elsewhere), but should the target here now be "some level of support for the array-api"?

@dcherian
Copy link
Contributor

Yes!

@jbrockmendel
Copy link

ExtensionArrays are orthogonal to the array-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants