[REVIEW] Support of alternative array classes #934

madsbk · 2022-01-11T13:35:47Z

This PR implements support of alternative array types (other than NumPy arrays) by introducing a new argument, meta_array, that specifies the type/class of the underlying array.

The meta_array argument can be any class instance that can be used as the like argument in NumPy (see NEP 35).

Also, this PR implements support of CuPy arrays by introducing a CPU compressor:

class CuPyCPUCompressor(Codec):
    """CPU compressor for CuPy arrays

    This compressor moves CuPy arrays to host memory before compressing
    the arrays using `compressor`.

    Parameters
    ----------
    compressor : numcodecs.abc.Codec
        The codec to use for compression and decompression.
    """

Example

A common limitation in GPU programming is available GPU memory. In the following, we create a Zarr array of CuPy arrays that are saved to host memory using the Zlib codec for compression/decompression:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

If we don't want compression but still want to copy to/from host memory, we can do the following:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor())

If the store handles CuPy arrays directly, we can disable compression:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=None)

Notice

This feature becomes more relevant when we have stores that support direct read/write of GPU memory. E.g. a store that use
GPUDirect Storage (GDS) would be able to bypass the host memory when copying GPU memory to/from disk.
This PR implements a new version of ensure_ndarray() and ensure_contiguous_ndarray() that accepts CuPy arrays as well as NumPy arrays. Instead, we should properly implement more generic versions of them in numcodecs (see WIP: Allow CuPy numcodecs#212).
This PR depend on NumPy version 1.20 since it uses NEP 35. Is this an issue?

TODO

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

cc. @jakirkham

pep8speaks · 2022-01-11T17:42:24Z

Hello @madsbk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-09-05 12:30:20 UTC

codecov · 2022-01-11T19:24:03Z

Codecov Report

Merging #934 (3ed7a7e) into main (f6698f6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##             main     #934    +/-   ##
========================================
  Coverage   99.94%   99.95%            
========================================
  Files          35       36     +1     
  Lines       13965    14084   +119     
========================================
+ Hits        13958    14077   +119     
  Misses          7        7

Impacted Files	Coverage Δ
zarr/creation.py	`100.00% <ø> (ø)`
zarr/core.py	`100.00% <100.00%> (ø)`
zarr/hierarchy.py	`99.79% <100.00%> (+<0.01%)`	⬆️
zarr/storage.py	`100.00% <100.00%> (ø)`
zarr/tests/test_meta_array.py	`100.00% <100.00%> (ø)`
zarr/util.py	`100.00% <100.00%> (ø)`

This is base of <https://github.com/jakirkham/zarr-python/tree/use_array_like> Co-authored-by: John Kirkham <jakirkham@gmail.com>

Checking against MutableMapping categories all BaseStores as in-memory stores.

jakirkham

Thanks Mads! 😄

Have a few initial comments below.

Still thinking through if there are ways we can improve the meta_array interface from the end user perspective. Curious to get your thoughts on this as well 🙂

jakirkham · 2022-01-20T22:39:47Z

zarr/core.py

@@ -1772,8 +1796,9 @@ def _process_chunk(
                self._dtype != object):

            dest = out[out_selection]
+            dest_is_writable = getattr(dest, "writeable", True)


Wouldn't we need to check flags for writeable?

CuPy array doesn't have a writeable flag: cupy/cupy#2616
Added

# Assume that array-like objects that doesn't have a # `writeable` flag is writable. dest_is_writable = getattr(dest, "writeable", True)

jakirkham · 2022-01-20T22:40:54Z

zarr/core.py

        if not is_scalar(value, self._dtype):
-            value = np.asanyarray(value)
+            try:
+                value = ensure_ndarray(value)


What happens if value is a bytes object both before and after this change?

Before, bytearrays are converted to uint8 arrays and bytes are converted to |S arrays.
Whereas now, both bytes and bytearrays are converted to a NumPy array of type uint8.

I don't think this make any difference in practice.

zarr/core.py

zarr/hierarchy.py

jakirkham · 2022-01-21T08:23:46Z

zarr/util.py

+# TODO: move the function to numcodecs
+def ensure_contiguous_ndarray(buf, max_buffer_size=None):


Just flagging this so we remember to move to Numcodecs

Curious to know what changes were needed here 🙂

It just need to use the modified ensure_ndarray

zarr/util.py

jakirkham · 2022-01-21T08:31:06Z

zarr/core.py

@@ -155,6 +161,7 @@ def __init__(
        cache_attrs=True,
        partial_decompress=False,
        write_empty_chunks=True,
+        meta_array=None,


I wonder if this could be picked up from the Store itself. That way a Store could specify its return type (NumPy or otherwise). This would save the user from getting involved as much in the process and hopefully make it easier for them to get started. WDYT?

@grlee77 would be good to get your thoughts as well in light of the BaseStore work 🙂

I think it would be a good idea to have Store define a default meta_array but in the general case, I think a Store should be able to handle both NumPy and other types like CuPy arrays simultaneously.

jakirkham · 2022-01-21T08:33:56Z

cc @d-v-b @joshmoore (as we discussed this last week it would be good to get your takes as well 🙂)

jakirkham · 2022-01-21T08:35:13Z

cc @d70-t ( in case you have thoughts here as well 🙂 )

joshmoore · 2022-01-21T08:54:36Z

Knowing little about the requirements here, just looking at the example:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

I wonder a couple of things:

Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)
Similarly, I wonder if a compressor wrapper is the best strategy. It may well be, but it would be good to have an idea of other related use cases. Are there any inspection methods on codecs to inform the wrappers of what they need to know? What other wrappers could we imagine? And as always, could we make it work with the user needing to remember to add the wrapper? (perhaps by making use of @data-apis style methods??)

Co-authored-by: jakirkham <jakirkham@gmail.com>

…s writable.

Co-authored-by: jakirkham <jakirkham@gmail.com>

d70-t

I'm with @joshmoore, that this example:

z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

looks a little odd.

I'm wondering if we should make more of a distinction between

the compressor specification (the goal of how bytes should look like in the store) and
the compressor implementation (the path of how to make and read those bytes specifically)

When thinking of what I'd want to express in this line, I'd say

I want to put my array a onto a store (which is implicitly dict in this case), in chunks of 10 and it should be zlib-compressed.

I don't really care about how the data moves from a to the store (if a is a numpy array, I'd probably like to use numcodecs directly, if a is a cupy array, I'd like to use GDS and a GPU compressor if available and like to fall back to numcodecs if not). In that sense, the line should maybe read like z = zarr.array(a, chunks=10, compressor=Zlib()).

Afterwards, I'd probably want to get a selection back from the zarr-array z, and I might want that to live on the GPU (which might be a different choice from what a was...), this creates a bit of a problem. I might want to write something like:

selection: cupy.ndarray = z[3:7]

... but Python doesnt't (yet?) work this way, so we need to pass the type-information differently into z. It might be

selection = z.get(slice(3,7), meta_array=cupy.empty(()))

which is far less beautiful than

selection = z[3:7]

For the last option to work, there must be a way to change the default return type of the []-operator, so passing the meta_array into zarr.array additionally might be a viable option. Still something like the get might be useful for some, if the array might be used both on CPU and GPU.

In any case, up to now, the CuPyCPUCompressor (which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr array z. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue that cupy_cpu_compressor should maybe not appear in .zarray metadata.

zarr/core.py

zarr/hierarchy.py

madsbk · 2022-01-21T13:57:24Z

Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)

Agree, it would be a good idea to have Store define a default meta_array.

I'm with @joshmoore, that this example:
z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))
looks a little odd.

Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.
The main goal of this PR is to make it possible to write GPU aware stores without having Zarr force-casting everything to NumPy arrays.

When thinking of what I'd want to express in this line, I'd say

I want to put my array a onto a store (which is implicitly dict in this case), in chunks of 10 and it should be zlib-compressed.

Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information.
Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type.
I think this is a great idea and I think this PR actually gets us some of the way.

d70-t · 2022-01-21T15:12:00Z

Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.

True, a compressor-wrapper seems to be a relatively light way to implement such GPU aware stores. But I'm wondering if that's really the API we'll want to end up with. Probably we'll also want to end up having such compressor-wrappers as a simple way to make more compressor-implementations, but maybe we don't want to write them out at this function call. That's about what I meant with separating compressor-implementations and compressor-specifications. We likely want to create zlib-compressed data on the store (the specification), but might want to use the CuPyCPU(Zlib) implementation for the transformation at that very moment of writing the data.

Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information.

Do we need that information, when creating the array from scratch? The array will probably only contain the metadata keys and those should maybe be independent of the compressor-implementation.

Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type.

Yes, that's about what I had in mind. It would be great to have arrays of any kind on one side and stores and compression specifications on the other side and some flexible, auto-optimizing routing in between. I believe that the specific algorithm of how to compress / decompress and move the data in between should neither belong to the store nor to the array-type.

In particular, it seems to be a totally valid approach to use e.g. a CPU to write data to some (e.g. S3-) store and read data back from the same store later on via GPU.
If one might want to go totally crazy, one might even write some parts of an array using CPU and other parts using GPU (I know of some people in C++ communities who work on parallelizing loops such that parts of the same loop run on available CPUs and parts run on available GPUs in order to get the most out of a system).

d70-t · 2022-01-21T15:21:35Z

I just stumbled again across the mentioned entrypoints... It's probably worth a thought if something roughly like:

compressors =[
    {"array": "numpy", "algorithm": "zlib": "compressor": Zlib, "priority": 1},
    {"array": "cupy", "algorithm": "zlib": "compressor": CuPyCPUCompressor(Zlib), "priority": 2},
    {"array": "cupy", "algorithm": "zlib": "compressor": CuPyGDSZlib, "priority": 1},
    ...
]

could be established via entrypoints? That list would be dependent on which libraries are present on the system and would help finding the right compressor implementation for a given combination of array and codec.... Still I don't have though yet about how that might interfere with filters, which may also play a role...

madsbk · 2022-01-24T08:52:07Z

@d70-t I agree with your vision but I think we should move in small steps.

I think we all can agree that we need some way to specify the desired array-like type of a getitem operation. For backward compatibility and for convenience, I suggest that we specify this through the meta_array argument on zarr.Array creation. But this is only the default array-like type, I agree with @d70-t, it should be possible to overwrite the default type using something like:

selection = z.get(slice(3,7), meta_array=cupy.empty(()))

And as @joshmoore suggest, we might even make it higher-level configurable (e.g. through a config file and/or context manager).

However, I think this is outside of the scope of this PR. The goal here is to avoid force-casting all data to NumPy arrays.

In any case, up to now, the CuPyCPUCompressor (which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr array z. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue that cupy_cpu_compressor should maybe not appear in .zarray metadata.

Again, I agree. I can move the implementation of CuPyCPUCompressor to test_cupy.py to make it clear that it is just a way to use CuPy arrays with the current API and is not part of the official API.

d70-t · 2022-01-24T10:39:26Z

👍 for moving in small steps. And I also agree to the rest of the post @madsbk

The only thing vision-wise which I believe should be accounted for right now, would be that indications of the (in-memory) array type should not be part of the zarr-metadata (and I'm still uncertain about if it would be ok to do so as a part of a test. Probably that's ok). But that's more a personal feeling and should maybe be up to discussion and others might have different opinions.

This requires zarr-developers/zarr-python#934

madsbk · 2022-08-02T11:56:50Z

This is ready for another round of reviews

cc. @jakirkham, @d70-t

joshmoore · 2022-08-06T02:27:36Z

@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?

…py_support

madsbk · 2022-08-08T07:18:44Z

@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?

The codecov errors can be ignored. As far as I can see, all of them are false negatives.

joshmoore · 2022-08-11T01:12:58Z

The codecov errors can be ignored. As far as I can see, all of them are false negatives.

Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?

madsbk · 2022-08-11T07:46:55Z

Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?

You are absolutely right! No idea how I could miss that :)

Adding tests of meta_array when CuPy isn't available.

…py_support

madsbk · 2022-09-05T12:31:36Z

Hi @joshmoore, anything you need from my side to get progress here?

joshmoore · 2022-09-06T11:34:15Z

Hi @madsbk. No, sorry. This just landed on the vacation ⛰️. I'll start reviewing for a 2.13 release ASAP.

joshmoore · 2022-09-08T07:02:55Z

Rolling into a 2.13 pre-release. If anyone has any lingering concerns about the API, there would still be a (small) window to address them in.

which include zarr-developers/zarr-python#934

which include zarr-developers/zarr-python#934 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #129

jakirkham · 2022-09-23T17:19:29Z

Thanks Mads for working on this and Josh for reviewing! 🙏

madsbk force-pushed the cupy_support branch from b19fbcf to 6711cab Compare January 11, 2022 17:44

madsbk marked this pull request as ready for review January 12, 2022 07:36

madsbk and others added 8 commits January 12, 2022 20:56

Implement CuPyCPUCompressor and the meta_array argument

590ace6

This is base of <https://github.com/jakirkham/zarr-python/tree/use_array_like> Co-authored-by: John Kirkham <jakirkham@gmail.com>

Adding meta_array to open_group()

254bbae

CuPyCPUCompressor: clean up and doc

40d084e

clean up

947152a

flake8

a60b2f4

mypy

8da9f17

Use KVStore when checking for in-memory data

0bf1cf0

Checking against MutableMapping categories all BaseStores as in-memory stores.

group: the meta_array argument is now used for new arrays

9795f77

madsbk force-pushed the cupy_support branch from 25cacb1 to 9795f77 Compare January 12, 2022 19:59

madsbk changed the title ~~Support of alternative array classes~~ [REVIEW] Support of alternative array classes Jan 12, 2022

flake8

bf03fd7

jakirkham reviewed Jan 21, 2022

View reviewed changes

madsbk and others added 4 commits January 21, 2022 11:30

Use empty_like instead of empty

d78ce33

Co-authored-by: jakirkham <jakirkham@gmail.com>

More use of NumPy's *_like API

c0402e2

Co-authored-by: jakirkham <jakirkham@gmail.com>

Assume that array-like objects that doesn't have a writeable flag i…

39ffef5

…s writable.

_meta_array: use shape=()

86a1ec6

Co-authored-by: jakirkham <jakirkham@gmail.com>

d70-t reviewed Jan 21, 2022

View reviewed changes

zarr/core.py Outdated Show resolved Hide resolved

zarr/hierarchy.py Outdated Show resolved Hide resolved

raydouglass pushed a commit to rapidsai/kvikio that referenced this pull request Jan 28, 2022

Adding test of Zarr Array and Group

4cff81c

This requires zarr-developers/zarr-python#934

madsbk added 2 commits August 2, 2022 12:39

doc-meta_array: changed to versionadded:: 2.13

18c4c6b

test_cupy: assert meta_array

2efa2fe

dcherian mentioned this pull request Aug 2, 2022

Add Kvikio backend entrypoint xarray-contrib/cupy-xarray#10

Draft

8 tasks

Merge branch 'main' of github.com:zarr-developers/zarr-python into cu…

440753c

…py_support

test_cupy: test when CuPy isn't available

0dc8f93

renamed: test_cupy.py -> test_meta_array.py

fdcf949

madsbk force-pushed the cupy_support branch from 2042c00 to fdcf949 Compare August 11, 2022 07:51

madsbk added 2 commits August 11, 2022 12:37

removed ensure_cls()

d40593b

Added "# pragma: no cover" to the CuPyCPUCompressor test class

149d511

weiji14 mentioned this pull request Aug 25, 2022

[use case demonstration] Kvikio Direct-to-gpu -> xarray -> xbatcher -> ml model xarray-contrib/xbatcher#87

Open

Merge branch 'main' of github.com:zarr-developers/zarr-python into cu…

3ed7a7e

…py_support

joshmoore merged commit 43266ee into zarr-developers:main Sep 8, 2022

joshmoore mentioned this pull request Sep 8, 2022

2.13.0a2 release notes #1126

Merged

6 tasks

madsbk added a commit to madsbk/kvikio that referenced this pull request Sep 9, 2022

Use Zarr v2.13.0a2

91cb3e8

which include zarr-developers/zarr-python#934

madsbk mentioned this pull request Sep 9, 2022

Use Zarr v2.13.0a2 rapidsai/kvikio#129

Merged

quasiben mentioned this pull request Sep 9, 2022

Track Zarr-Python Integration rapidsai/kvikio#130

Closed

srib mentioned this pull request Sep 23, 2022

GPU direct storage backend for mdio TGSAI/mdio-python#64

Closed

rabernat mentioned this pull request Oct 7, 2022

Interest in a zarr.sparse module? #424

Open

madsbk mentioned this pull request Dec 6, 2022

FSStore: use ensure_bytes() #1285

Merged

5 tasks

madsbk deleted the cupy_support branch April 20, 2023 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Support of alternative array classes #934

[REVIEW] Support of alternative array classes #934

madsbk commented Jan 11, 2022 •

edited

pep8speaks commented Jan 11, 2022 •

edited

codecov bot commented Jan 11, 2022 •

edited

jakirkham left a comment

jakirkham Jan 20, 2022

madsbk Jan 21, 2022

jakirkham Jan 20, 2022

madsbk Jan 21, 2022

jakirkham Jan 21, 2022

madsbk Jan 21, 2022

jakirkham Jan 21, 2022

madsbk Jan 21, 2022

jakirkham commented Jan 21, 2022

jakirkham commented Jan 21, 2022

joshmoore commented Jan 21, 2022

d70-t left a comment

madsbk commented Jan 21, 2022 •

edited

d70-t commented Jan 21, 2022

d70-t commented Jan 21, 2022

madsbk commented Jan 24, 2022

d70-t commented Jan 24, 2022

madsbk commented Aug 2, 2022

joshmoore commented Aug 6, 2022

madsbk commented Aug 8, 2022

joshmoore commented Aug 11, 2022

madsbk commented Aug 11, 2022

madsbk commented Sep 5, 2022

joshmoore commented Sep 6, 2022

joshmoore commented Sep 8, 2022

jakirkham commented Sep 23, 2022

		# TODO: move the function to numcodecs
		def ensure_contiguous_ndarray(buf, max_buffer_size=None):

[REVIEW] Support of alternative array classes #934

[REVIEW] Support of alternative array classes #934

Conversation

madsbk commented Jan 11, 2022 • edited

Example

Notice

TODO

pep8speaks commented Jan 11, 2022 • edited

Comment last updated at 2022-09-05 12:30:20 UTC

codecov bot commented Jan 11, 2022 • edited

Codecov Report

jakirkham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Jan 21, 2022

jakirkham commented Jan 21, 2022

joshmoore commented Jan 21, 2022

d70-t left a comment

Choose a reason for hiding this comment

madsbk commented Jan 21, 2022 • edited

d70-t commented Jan 21, 2022

d70-t commented Jan 21, 2022

madsbk commented Jan 24, 2022

d70-t commented Jan 24, 2022

madsbk commented Aug 2, 2022

joshmoore commented Aug 6, 2022

madsbk commented Aug 8, 2022

joshmoore commented Aug 11, 2022

madsbk commented Aug 11, 2022

madsbk commented Sep 5, 2022

joshmoore commented Sep 6, 2022

joshmoore commented Sep 8, 2022

jakirkham commented Sep 23, 2022

madsbk commented Jan 11, 2022 •

edited

pep8speaks commented Jan 11, 2022 •

edited

codecov bot commented Jan 11, 2022 •

edited

madsbk commented Jan 21, 2022 •

edited