Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): Support for pandas ExtensionArray #8723

Merged
merged 101 commits into from Apr 18, 2024

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Feb 8, 2024

Some outstanding points/decisions brought up by this PR:

  • Confirm type promotion rules and write them out. As it stands now, if everything is of the same extension array type, it is passed onwards and otherwise is converted to numpy. (related: Avoid coercing to numpy in as_shared_dtypes #8714)
    - [ ] Acceptance of plum as a dispatch method. Without it, the behavior should be fallen back on from before (cast to numpy types). I am a big fan of dispatching and think it could serve as a model going forward for making support of other data types/arrays more feasible. The other option, I think, would be to just use the underlying array of the ExtensionDuckArray class to decide and then have some central registry that serves as the basis for a decorator (like the api for accessors via _CachedAccessor). That being said, the current defaults are quite good so this is a marginal feature, in all likelihood.
  • Do we allow just pandas ExtensionArray directly or can we also allow Series?

Possible missing something else! Let me know!

Checklist:

Copy link

welcome bot commented Feb 8, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@TomNicholas
Copy link
Contributor

This looks like a big feature addition!

I'm very ignorant of ExtensionArrays - but is it possible to imagine a design where ExtensionDuckArray conforms well enough to the duck array structure that xarray expects that ExtensionDuckArray could live entirely outside of xarray? I'm also curious why we need isinstance(data, ExtensionDuckArray rather than calling our already-existing is_duck_array(data).

@TomNicholas TomNicholas added the topic-arrays related to flexible array support label Feb 8, 2024
@TomNicholas TomNicholas added this to In progress in Duck Array Wrapping via automation Feb 8, 2024
@ilan-gold
Copy link
Contributor Author

What's the status here?

xarray/tests/__init__.py Outdated Show resolved Hide resolved
@dcherian dcherian added the plan to merge Final call for comments label Apr 13, 2024
@dcherian
Copy link
Contributor

Sorry, this fell off my radar. Can you open an issue regarding the to_dataframe problem?

@dcherian dcherian enabled auto-merge (squash) April 16, 2024 15:10
@ilan-gold ilan-gold disabled auto-merge April 17, 2024 08:48
@dcherian dcherian merged commit 9eb180b into pydata:main Apr 18, 2024
31 checks passed
Duck Array Wrapping automation moved this from In progress to Done Apr 18, 2024
Copy link

welcome bot commented Apr 18, 2024

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

dcherian added a commit to djhoese/xarray that referenced this pull request Apr 18, 2024
* main:
  (feat): Support for `pandas` `ExtensionArray` (pydata#8723)
  Migrate datatree mapping.py (pydata#8948)
  Add mypy to dev dependencies (pydata#8947)
  Convert 360_day calendars by choosing random dates to drop or add (pydata#8603)
Comment on lines +236 to +237
elif array_type_cupy := array_type("cupy") and any( # noqa: F841
isinstance(x, array_type_cupy) for x in scalars_or_arrays # noqa: F821
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this kind of syntax allowed?

I suspect the CI didn't run this code:

$ pytest xarray/tests/test_array_api.py -x
================================= test session starts =================================
platform linux -- Python 3.10.14, pytest-8.2.0, pluggy-1.5.0
rootdir: /home/mark/git/xarray
configfile: pyproject.toml
collected 13 items

xarray/tests/test_array_api.py .F

====================================== FAILURES =======================================
__________________________________ test_aggregation ___________________________________

arrays = (<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[ 1.,  2.,  3.],
       [ 4.,  5., nan]])
Coordinates:
  * x        ...      [ 4.,  5., nan]], dtype=float64)
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y)

    def test_aggregation(arrays: tuple[xr.DataArray, xr.DataArray]) -> None:
        np_arr, xp_arr = arrays
>       expected = np_arr.sum()

/home/mark/git/xarray/xarray/tests/test_array_api.py:51:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/mark/git/xarray/xarray/core/_aggregations.py:1857: in sum
    return self.reduce(
/home/mark/git/xarray/xarray/core/dataarray.py:3805: in reduce
    var = self.variable.reduce(func, dim, axis, keep_attrs, keepdims, **kwargs)
/home/mark/git/xarray/xarray/core/variable.py:1663: in reduce
    result = super().reduce(
/home/mark/git/xarray/xarray/namedarray/core.py:889: in reduce
    data = func(self.data, **kwargs)
/home/mark/git/xarray/xarray/core/duck_array_ops.py:427: in f
    return func(values, axis=axis, **kwargs)
/home/mark/git/xarray/xarray/core/nanops.py:99: in nansum
    result = sum_where(a, axis=axis, dtype=dtype, where=mask)
/home/mark/git/xarray/xarray/core/duck_array_ops.py:333: in sum_where
    a = where_method(xp.zeros_like(data), where, data)
/home/mark/git/xarray/xarray/core/duck_array_ops.py:349: in where_method
    return where(cond, data, other)
/home/mark/git/xarray/xarray/core/duck_array_ops.py:343: in where
    return xp.where(condition, *as_shared_dtype([x, y], xp=xp))
/home/mark/git/xarray/xarray/core/duck_array_ops.py:236: in as_shared_dtype
    elif array_type_cupy := array_type("cupy") and any(  # noqa: F841
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <list_iterator object at 0x7e404174b6d0>

    elif array_type_cupy := array_type("cupy") and any(  # noqa: F841
>       isinstance(x, array_type_cupy) for x in scalars_or_arrays  # noqa: F821
    ):
E   NameError: free variable 'array_type_cupy' referenced before assignment in enclosing scope

/home/mark/git/xarray/xarray/core/duck_array_ops.py:237: NameError
=============================== short test summary info ===============================
FAILED xarray/tests/test_array_api.py::test_aggregation - NameError: free variable 'array_type_cupy' referenced before assignment in enclosi...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================= 1 failed, 1 passed in 0.33s =============================

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmaarrfk Interesting. I see you fixed this. I must have done this when testing because I do specifically remember testing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues. It was a straightforward fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-arrays related to flexible array support
Projects
Development

Successfully merging this pull request may close these issues.

Categorical Array Support for pandas Extension Arrays
6 participants