Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid coercing to numpy in as_shared_dtypes #8714

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ Bug fixes
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Ensure :py:meth:`DataArray.unstack` works when wrapping array API-compliant classes. (:issue:`8666`, :pull:`8668`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Avoid coercing to numpy arrays inside :py:func:`~xarray.core.duck_array_ops.as_shared_dtype`. (:pull:`8714`)
By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Preserve chunks when writing time-like variables to zarr by enabling lazy CF
encoding of time-like variables (:issue:`7132`, :issue:`8230`, :issue:`8432`,
:pull:`8575`). By `Spencer Clark <https://github.com/spencerkclark>`_ and
Expand Down
22 changes: 5 additions & 17 deletions xarray/core/duck_array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
from xarray.core import dask_array_ops, dtypes, nputils, pycompat
from xarray.core.options import OPTIONS
from xarray.core.parallelcompat import get_chunked_array_type, is_chunked_array
from xarray.core.pycompat import array_type, is_duck_dask_array
from xarray.core.pycompat import is_duck_dask_array, to_duck_array
from xarray.core.utils import is_duck_array, module_available

# remove once numpy 2.0 is the oldest supported version
Expand Down Expand Up @@ -219,22 +219,10 @@ def asarray(data, xp=np):


def as_shared_dtype(scalars_or_arrays, xp=np):
"""Cast a arrays to a shared dtype using xarray's type promotion rules."""
array_type_cupy = array_type("cupy")
if array_type_cupy and any(
isinstance(x, array_type_cupy) for x in scalars_or_arrays
):
import cupy as cp

arrays = [asarray(x, xp=cp) for x in scalars_or_arrays]
else:
arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this asarray call would coerce to numpy unnecessarily, when all we really wanted was an array type that we could examine the .dtype attribute of.

# Pass arrays directly instead of dtypes to result_type so scalars
# get handled properly.
# Note that result_type() safely gets the dtype from dask arrays without
# evaluating them.
out_type = dtypes.result_type(*arrays)
return [astype(x, out_type, copy=False) for x in arrays]
"""Cast arrays to a shared dtype using xarray's type promotion rules."""
duckarrays = [to_duck_array(obj, xp=xp) for obj in scalars_or_arrays]
Copy link
Contributor

@dcherian dcherian Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but will force a read from disk. We could add a dtype property that forwards to the underlying array.dtype

EDIT: I don't think my comment is right, since we expect to return duck arrays here, it's ok to just read from disk and create that duck array.

It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.

Copy link
Member Author

@TomNicholas TomNicholas Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "read from disk" do you meaning calling the __array__ attribute of the innermost duckarray? Because that's what I'm trying to avoid.

EDIT: Or you mean resolving all these wrapper layers (either by calling __array__ or get_duck_array())?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you meaning calling the array attribute of the innermost duckarray?

I think our naming convention is that "duck array" is a "computational array" e.g.numpy, dask but NOT our explicitly-indexed array classes. The latter wrap duck arrays.

Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).

(related : zarr-developers/zarr-python#1603 (comment))

PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).

Yes, but my KerchunkArray case is interesting because I don't want to use BackendArray (I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray at all).

PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs

Yeah that could be helpful actually :) I'm learning a lot right now about a part of xarray I have never had a reason to look at before!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but my KerchunkArray case is interesting because I don't want to use BackendArray

Well then don't use the backend infrastructure? :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yes yes

No but seriously I did think about that and I do think that it does make sense to use the backend infrastructure here. I could make my full case, but after all we are still reading from files here, we just aren't reading the bytes inside the chunks.

out_type = dtypes.result_type(*duckarrays)
return [astype(x, out_type, copy=False) for x in duckarrays]


def broadcast_to(array, shape):
Expand Down
19 changes: 14 additions & 5 deletions xarray/core/pycompat.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import numpy as np
from packaging.version import Version

from xarray.core.types import T_DuckArray
from xarray.core.utils import is_duck_array, is_scalar, module_available

integer_types = (int, np.integer)
Expand Down Expand Up @@ -86,23 +87,23 @@ def mod_version(mod: ModType) -> Version:
return _get_cached_duck_array_module(mod).version


def is_dask_collection(x):
def is_dask_collection(x) -> bool:
if module_available("dask"):
from dask.base import is_dask_collection

return is_dask_collection(x)
return False


def is_duck_dask_array(x):
def is_duck_dask_array(x) -> bool:
return is_duck_array(x) and is_dask_collection(x)


def is_chunked_array(x) -> bool:
return is_duck_dask_array(x) or (is_duck_array(x) and hasattr(x, "chunks"))


def is_0d_dask_array(x):
def is_0d_dask_array(x) -> bool:
return is_duck_dask_array(x) and is_scalar(x)


Expand All @@ -129,12 +130,20 @@ def to_numpy(data) -> np.ndarray:
return data


def to_duck_array(data):
def to_duck_array(data, xp=np) -> T_DuckArray:
from xarray.core.indexing import ExplicitlyIndexed

if isinstance(data, ExplicitlyIndexed):
return data.get_duck_array()
elif is_duck_array(data):
return data
else:
return np.asarray(data)
from xarray.core.duck_array_ops import asarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the to_numpy in this file instead?


array_type_cupy = array_type("cupy")
if array_type_cupy and any(isinstance(data, array_type_cupy)):
import cupy as cp

return asarray(data, xp=cp)
else:
return asarray(data, xp=xp)
Loading