Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked array docs #7951

Merged
merged 75 commits into from Jul 5, 2023
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
0217fe3
draft updates
TomNicholas Jun 12, 2023
5a221bb
discuss array API standard
TomNicholas Jun 12, 2023
1971da4
fix sparse examples so they run
TomNicholas Jun 13, 2023
fa58fff
Deepak's suggestions
TomNicholas Jun 14, 2023
258dd54
link to duck arrays user guide from internals page
TomNicholas Jun 14, 2023
b26e7ac
fix various links
TomNicholas Jun 15, 2023
ad81811
itemized list
TomNicholas Jun 15, 2023
99394a3
mention dispatching on functions not in the array API standard
TomNicholas Jun 15, 2023
c93f143
examples of duckarrays
TomNicholas Jun 21, 2023
b6279fd
add intended audience to xarray internals section
TomNicholas Jun 21, 2023
e0bd049
draft page on chunked arrays
TomNicholas Jun 21, 2023
0eea00b
move paragraph on why its called a duck array upwards
TomNicholas Jun 27, 2023
cc4fac0
delete section on numpy ufuncs
TomNicholas Jun 27, 2023
5e8015f
explain difference between .values and to_numpy
TomNicholas Jun 27, 2023
70bfda5
strongly prefer to_numpy over values
TomNicholas Jun 27, 2023
5fdb7e3
recommend to_numpy instead of values in the how do I? page
TomNicholas Jun 27, 2023
68315f8
clearer about using to_numpy
TomNicholas Jun 27, 2023
2931b86
merge section on missing features
TomNicholas Jun 27, 2023
9f21b00
remove todense from examples
TomNicholas Jun 27, 2023
2bb65d5
whatsnew
TomNicholas Jun 27, 2023
f0ba66c
Merge branch 'main' into duckarray-docs
TomNicholas Jun 27, 2023
0b405a1
double that
TomNicholas Jun 28, 2023
ed6195c
numpy array class clarification
TomNicholas Jun 28, 2023
40eb53b
Remove sentence about xarray's internals
TomNicholas Jun 28, 2023
a567aa4
array API standard
TomNicholas Jun 28, 2023
76237a9
proper link for sparse.COO type
TomNicholas Jun 28, 2023
1923d4b
links to docstrings of array types
TomNicholas Jun 28, 2023
b26cbd8
don't put variable in parentheses
TomNicholas Jun 28, 2023
f62b4a9
double backquote formatting
TomNicholas Jun 28, 2023
8d4bd3f
better bracketing
TomNicholas Jun 28, 2023
e9287de
fix list formatting
TomNicholas Jun 28, 2023
d1e9b8f
add links to glue packages, dask, and cubed
TomNicholas Jun 28, 2023
d545d5d
Merge branch 'duckarray-docs' of https://github.com/TomNicholas/xarra…
TomNicholas Jun 28, 2023
1ea2078
link to todense method
TomNicholas Jun 28, 2023
be919b6
link to numpy-like arrays page
TomNicholas Jun 28, 2023
0c0a547
Merge branch 'duckarray-docs' of https://github.com/TomNicholas/xarra…
TomNicholas Jun 28, 2023
d03e125
link to numpy ufunc docs
TomNicholas Jun 28, 2023
41f2d68
Merge branch 'duckarray-docs' into chunked-array-docs
TomNicholas Jun 28, 2023
c50f531
more text about chunkmanagers
TomNicholas Jun 28, 2023
90a8bcb
add example of using .to_numpy
TomNicholas Jun 28, 2023
9a086c5
note on ideally not having an entrypoint system
TomNicholas Jun 28, 2023
cfd1396
parallel processing without chunks
TomNicholas Jun 28, 2023
9dc63c0
explain the user interface
TomNicholas Jun 28, 2023
7bdd976
how to register the chunkmanager
TomNicholas Jun 28, 2023
14057b9
show example of .values failing
TomNicholas Jun 28, 2023
9fd2b41
Merge branch 'duckarray-docs' into chunked-array-docs
TomNicholas Jun 28, 2023
098e152
link from duck arrays page
TomNicholas Jun 28, 2023
c7b686a
whatsnew
TomNicholas Jun 28, 2023
45000e4
move whatsnew entry to unreleased version
TomNicholas Jun 28, 2023
80e9fa4
capitalization
TomNicholas Jun 28, 2023
da8719d
fix warning in docs build
TomNicholas Jun 28, 2023
a83460f
Merge branch 'duckarray-docs' into chunked-array-docs
TomNicholas Jun 28, 2023
5d4496e
Merge branch 'main' into chunked-array-docs
TomNicholas Jun 29, 2023
6c1a422
fix a bunch of links
TomNicholas Jul 1, 2023
2a40763
display API of ChunkManagerEntrypoint class attributes and methods
TomNicholas Jul 4, 2023
15505fa
improve docstrings in ABC
TomNicholas Jul 4, 2023
1bf55d3
add cubed to intersphinx mapping
TomNicholas Jul 4, 2023
04a7a8e
link to dask.array as module not class
TomNicholas Jul 4, 2023
8b74e89
typo
TomNicholas Jul 5, 2023
cab76bf
fix bold formatting
TomNicholas Jul 5, 2023
d120fab
proper docstrings
TomNicholas Jul 5, 2023
d619a71
Merge branch 'main' into chunked-array-docs
TomNicholas Jul 5, 2023
039b9d7
mention from_array specifically and link to requirements section of d…
TomNicholas Jul 5, 2023
d6c3cba
add explicit link to cubed
TomNicholas Jul 5, 2023
7386f66
mention ramba and arkouda
TomNicholas Jul 5, 2023
325254d
Merge branch 'main' into chunked-array-docs
dcherian Jul 5, 2023
865f4c7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 5, 2023
c8c1c20
py:mod
TomNicholas Jul 5, 2023
c975741
Merge branch 'chunked-array-docs' of https://github.com/TomNicholas/x…
TomNicholas Jul 5, 2023
cfe5540
Merge branch 'main' into chunked-array-docs
TomNicholas Jul 5, 2023
6d7b68b
Present tense regarding wrapping cubed
TomNicholas Jul 5, 2023
f759cdf
Merge branch 'chunked-array-docs' of https://github.com/TomNicholas/x…
TomNicholas Jul 5, 2023
317a35f
add links to cubed
TomNicholas Jul 5, 2023
71238ce
add references for numpy links in apply_gufunc docstring
TomNicholas Jul 5, 2023
a66c25a
fix some broken links to docstrings
TomNicholas Jul 5, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/conf.py
Expand Up @@ -323,6 +323,7 @@
"dask": ("https://docs.dask.org/en/latest", None),
"cftime": ("https://unidata.github.io/cftime", None),
"sparse": ("https://sparse.pydata.org/en/latest/", None),
"cubed": ("https://tom-e-white.com/cubed/", None),
}


Expand Down
101 changes: 101 additions & 0 deletions doc/internals/chunked-arrays.rst
@@ -0,0 +1,101 @@
.. currentmodule:: xarray

.. _internals.chunkedarrays:

Alternative chunked array types
===============================

.. warning::

This is a *highly* experimental feature. Please report any bugs or other difficulties on `xarray's issue tracker <https://github.com/pydata/xarray/issues>`_.
In particular see discussion on `xarray issue #6807 <https://github.com/pydata/xarray/issues/6807>`_

Xarray can wrap chunked dask arrays (see :ref:`dask`), but can also wrap any other chunked array type that exposes the correct interface.
This allows us to support using other frameworks for distributed and out-of-core processing, with user code still written as xarray commands.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
In particular xarray can now also supports wrapping :py:class:`cubed.Array` objects
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
(see `Cubed's documentation <https://tom-e-white.com/cubed/>`_ and the `cubed-xarray package <https://github.com/xarray-contrib/cubed-xarray>`_).

The basic idea is that by wrapping an array that has an explicit notion of ``.chunks``, xarray can expose control over
the choice of chunking scheme to users via methods like :py:meth:`DataArray.chunk` whilst the wrapped array actually
implements the handling of processing all of the chunks.

Chunked array methods and "core operations"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A chunked array needs to meet all the :ref:`requirements for normal duck arrays <internals.duckarrays.requirements>`, but must also
implement additional features.

Chunked arrays have additional attributes and methods, such as ``.chunks`` and ``.rechunk``.
Furthermore, Xarray dispatches chunk-aware computations across one or more chunked arrays using special functions known
as "core operations". Examples include ``map_blocks``, ``blockwise``, and ``apply_gufunc``.

The core operations are generalizations of functions first implemented in :py:module:`dask.array`.
The implementation of these functions is specific to the type of arrays passed to them. For example, when applying the
``map_blocks`` core operation, :py:class:`dask.array.Array` objects must be processed by :py:func:`dask.array.map_blocks`,
whereas :py:class:`cubed.Array` objects must be processed by :py:func:`cubed.map_blocks`.

In order to use the correct implementation of a core operation for the array type encountered, xarray dispatches to the
corresponding subclass of :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint`,
also known as a "Chunk Manager". Therefore **a full list of the operations that need to be defined is set by the
API of the** :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint` **abstract base class**. Note that chunked array
methods are also currently dispatched using this class.

Chunked array creation is also handled by this class. As chunked array objects have a one-to-one correspondence with
in-memory numpy arrays, it should be possible to create a chunked array from a numpy array by passing the desired
chunking pattern to an implementation of :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint.from_array``.

.. note::

The :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint` abstract base class is mostly just acting as a
namespace for containing the chunked-aware function primitives. Ideally in the future we would have an API standard
for chunked array types which codified this structure, making the entrypoint system unnecessary.

.. currentmodule:: xarray.core.parallelcompat

.. autoclass:: xarray.core.parallelcompat.ChunkManagerEntrypoint
:members:

Registering a new ChunkManagerEntrypoint subclass
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rather than hard-coding various chunk managers to deal with specific chunked array implementations, xarray uses an
entrypoint system to allow developers of new chunked array implementations to register their corresponding subclass of
:py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint`.


To register a new entrypoint you need to add an entry to the ``setup.cfg`` like this::

[options.entry_points]
xarray.chunkmanagers =
dask = xarray.core.daskmanager:DaskManager

See also `cubed-xarray <https://github.com/xarray-contrib/cubed-xarray>`_ for another example.

To check that the entrypoint has worked correctly, you may find it useful to display the available chunkmanagers using
the internal function :py:func:`~xarray.core.parallelcompat.list_chunkmanagers`.

.. autofunction:: list_chunkmanagers


User interface
~~~~~~~~~~~~~~

Once the chunkmanager subclass has been registered, xarray objects wrapping the desired array type can be created in 3 ways:

#. By manually passing the array type to the :py:class:`DataArray` constructor, see the examples for :ref:`numpy-like arrays <userguide.duckarrays>`,

#. Calling :py:meth:`DataArray.chunk`, passing the keyword arguments ``chunked_array_type`` and ``from_array_kwargs``,

#. Calling :py:func:`open_dataset`, passing the keyword arguments ``chunked_array_type`` and ``from_array_kwargs``.

The latter two methods ultimately call the chunkmanager's implementation of ``.from_array``, to which they pass the ``from_array_kwargs`` dict.
The ``chunked_array_type`` kwarg selects which registered chunkmanager subclass to dispatch to. It defaults to ``'dask'`` if found,
otherwise to whichever chunkmanager is registered if only one is registered, else it will raise an error.

Parallel processing without chunks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use a parallel array type that does not expose a concept of chunks explicitly, none of the information on this page
is theoretically required. Such an array type (e.g. `Ramba <https://github.com/Python-for-HPC/ramba>`_ or
`Arkouda <https://github.com/Bears-R-Us/arkouda>`_) could be wrapped using xarray's existing support for
:ref:`numpy-like "duck" arrays <userguide.duckarrays>`.
2 changes: 2 additions & 0 deletions doc/internals/duck-arrays-integration.rst
Expand Up @@ -11,6 +11,8 @@ Integrating with duck arrays
Xarray can wrap custom numpy-like arrays (":term:`duck array`\s") - see the :ref:`user guide documentation <userguide.duckarrays>`.
This page is intended for developers who are interested in wrapping a new custom array type with xarray.

.. _internals.duckarrays.requirements:

Duck array requirements
~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions doc/internals/index.rst
Expand Up @@ -21,6 +21,7 @@ The pages in this section are intended for:

variable-objects
duck-arrays-integration
chunked-arrays
extending-xarray
zarr-encoding-spec
how-to-add-new-backend
2 changes: 1 addition & 1 deletion doc/user-guide/duckarrays.rst
Expand Up @@ -27,7 +27,7 @@ Some numpy-like array types that xarray already has some support for:

For information on wrapping dask arrays see :ref:`dask`. Whilst xarray wraps dask arrays in a similar way to that
described on this page, chunked array types like :py:class:`dask.array.Array` implement additional methods that require
slightly different user code (e.g. calling ``.chunk`` or ``.compute``).
slightly different user code (e.g. calling ``.chunk`` or ``.compute``). See the docs on :ref:`wrapping chunked arrays <internals.chunkedarrays>`.

Why "duck"?
-----------
Expand Down
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Expand Up @@ -38,6 +38,8 @@ Bug fixes
Documentation
~~~~~~~~~~~~~

- Added page on wrapping chunked numpy-like arrays as alternatives to dask arrays.
(:pull:`7951`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Expanded the page on wrapping numpy-like "duck" arrays.
(:pull:`7911`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Added examples to docstrings of :py:meth:`Dataset.isel`, :py:meth:`Dataset.reduce`, :py:meth:`Dataset.argmin`,
Expand Down
2 changes: 1 addition & 1 deletion xarray/core/dataset.py
Expand Up @@ -8648,7 +8648,7 @@ def argmax(self: T_Dataset, dim: Hashable | None = None, **kwargs) -> T_Dataset:
... )

# Indices of the maximum values along the 'student' dimension are calculated

>>> argmax_indices = dataset.argmax(dim="test")

>>> argmax_indices
Expand Down