Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Support of alternative array classes #934

Merged
merged 40 commits into from
Sep 8, 2022

Conversation

madsbk
Copy link
Contributor

@madsbk madsbk commented Jan 11, 2022

This PR implements support of alternative array types (other than NumPy arrays) by introducing a new argument, meta_array, that specifies the type/class of the underlying array.

The meta_array argument can be any class instance that can be used as the like argument in NumPy (see NEP 35).

Also, this PR implements support of CuPy arrays by introducing a CPU compressor:

class CuPyCPUCompressor(Codec):
    """CPU compressor for CuPy arrays

    This compressor moves CuPy arrays to host memory before compressing
    the arrays using `compressor`.

    Parameters
    ----------
    compressor : numcodecs.abc.Codec
        The codec to use for compression and decompression.
    """

Example

A common limitation in GPU programming is available GPU memory. In the following, we create a Zarr array of CuPy arrays that are saved to host memory using the Zlib codec for compression/decompression:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

If we don't want compression but still want to copy to/from host memory, we can do the following:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor())

If the store handles CuPy arrays directly, we can disable compression:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=None)

Notice

  • This feature becomes more relevant when we have stores that support direct read/write of GPU memory. E.g. a store that use
    GPUDirect Storage (GDS) would be able to bypass the host memory when copying GPU memory to/from disk.
  • This PR implements a new version of ensure_ndarray() and ensure_contiguous_ndarray() that accepts CuPy arrays as well as NumPy arrays. Instead, we should properly implement more generic versions of them in numcodecs (see WIP: Allow CuPy numcodecs#212).
  • This PR depend on NumPy version 1.20 since it uses NEP 35. Is this an issue?

TODO

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

cc. @jakirkham

@pep8speaks
Copy link

pep8speaks commented Jan 11, 2022

Hello @madsbk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-09-05 12:30:20 UTC

@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #934 (3ed7a7e) into main (f6698f6) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##             main     #934    +/-   ##
========================================
  Coverage   99.94%   99.95%            
========================================
  Files          35       36     +1     
  Lines       13965    14084   +119     
========================================
+ Hits        13958    14077   +119     
  Misses          7        7            
Impacted Files Coverage Δ
zarr/creation.py 100.00% <ø> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/hierarchy.py 99.79% <100.00%> (+<0.01%) ⬆️
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_meta_array.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

@madsbk madsbk marked this pull request as ready for review January 12, 2022 07:36
@madsbk madsbk changed the title Support of alternative array classes [REVIEW] Support of alternative array classes Jan 12, 2022
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mads! 😄

Have a few initial comments below.

Still thinking through if there are ways we can improve the meta_array interface from the end user perspective. Curious to get your thoughts on this as well 🙂

@@ -1772,8 +1796,9 @@ def _process_chunk(
self._dtype != object):

dest = out[out_selection]
dest_is_writable = getattr(dest, "writeable", True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we need to check flags for writeable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CuPy array doesn't have a writeable flag: cupy/cupy#2616
Added

# Assume that array-like objects that doesn't have a
# `writeable` flag is writable.
dest_is_writable = getattr(dest, "writeable", True)

zarr/core.py Outdated
if not is_scalar(value, self._dtype):
value = np.asanyarray(value)
try:
value = ensure_ndarray(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if value is a bytes object both before and after this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, bytearrays are converted to uint8 arrays and bytes are converted to |S arrays.
Whereas now, both bytes and bytearrays are converted to a NumPy array of type uint8.

I don't think this make any difference in practice.

zarr/core.py Outdated Show resolved Hide resolved
zarr/core.py Outdated Show resolved Hide resolved
zarr/core.py Outdated Show resolved Hide resolved
zarr/hierarchy.py Outdated Show resolved Hide resolved
zarr/util.py Outdated
Comment on lines 72 to 73
# TODO: move the function to numcodecs
def ensure_contiguous_ndarray(buf, max_buffer_size=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging this so we remember to move to Numcodecs

Curious to know what changes were needed here 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just need to use the modified ensure_ndarray

zarr/util.py Outdated Show resolved Hide resolved
zarr/util.py Outdated Show resolved Hide resolved
@@ -155,6 +161,7 @@ def __init__(
cache_attrs=True,
partial_decompress=False,
write_empty_chunks=True,
meta_array=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be picked up from the Store itself. That way a Store could specify its return type (NumPy or otherwise). This would save the user from getting involved as much in the process and hopefully make it easier for them to get started. WDYT?

@grlee77 would be good to get your thoughts as well in light of the BaseStore work 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to have Store define a default meta_array but in the general case, I think a Store should be able to handle both NumPy and other types like CuPy arrays simultaneously.

@jakirkham
Copy link
Member

cc @d-v-b @joshmoore (as we discussed this last week it would be good to get your takes as well 🙂)

@jakirkham
Copy link
Member

cc @d70-t ( in case you have thoughts here as well 🙂 )

@joshmoore
Copy link
Member

Knowing little about the requirements here, just looking at the example:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

I wonder a couple of things:

  • Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)
  • Similarly, I wonder if a compressor wrapper is the best strategy. It may well be, but it would be good to have an idea of other related use cases. Are there any inspection methods on codecs to inform the wrappers of what they need to know? What other wrappers could we imagine? And as always, could we make it work with the user needing to remember to add the wrapper? (perhaps by making use of @data-apis style methods??)

madsbk and others added 4 commits January 21, 2022 11:30
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: jakirkham <jakirkham@gmail.com>
Copy link
Contributor

@d70-t d70-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @joshmoore, that this example:

z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

looks a little odd.

I'm wondering if we should make more of a distinction between

  • the compressor specification (the goal of how bytes should look like in the store) and
  • the compressor implementation (the path of how to make and read those bytes specifically)

When thinking of what I'd want to express in this line, I'd say

I want to put my array a onto a store (which is implicitly dict in this case), in chunks of 10 and it should be zlib-compressed.

I don't really care about how the data moves from a to the store (if a is a numpy array, I'd probably like to use numcodecs directly, if a is a cupy array, I'd like to use GDS and a GPU compressor if available and like to fall back to numcodecs if not). In that sense, the line should maybe read like z = zarr.array(a, chunks=10, compressor=Zlib()).

Afterwards, I'd probably want to get a selection back from the zarr-array z, and I might want that to live on the GPU (which might be a different choice from what a was...), this creates a bit of a problem. I might want to write something like:

selection: cupy.ndarray = z[3:7]

... but Python doesnt't (yet?) work this way, so we need to pass the type-information differently into z. It might be

selection = z.get(slice(3,7), meta_array=cupy.empty(()))

which is far less beautiful than

selection = z[3:7]

For the last option to work, there must be a way to change the default return type of the []-operator, so passing the meta_array into zarr.array additionally might be a viable option. Still something like the get might be useful for some, if the array might be used both on CPU and GPU.

In any case, up to now, the CuPyCPUCompressor (which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr array z. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue that cupy_cpu_compressor should maybe not appear in .zarray metadata.

zarr/core.py Outdated Show resolved Hide resolved
zarr/hierarchy.py Outdated Show resolved Hide resolved
@madsbk
Copy link
Contributor Author

madsbk commented Jan 21, 2022

  • Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)

Agree, it would be a good idea to have Store define a default meta_array.

I'm with @joshmoore, that this example:

z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

looks a little odd.

Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.
The main goal of this PR is to make it possible to write GPU aware stores without having Zarr force-casting everything to NumPy arrays.

When thinking of what I'd want to express in this line, I'd say

I want to put my array a onto a store (which is implicitly dict in this case), in chunks of 10 and it should be zlib-compressed.

Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information.
Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type.
I think this is a great idea and I think this PR actually gets us some of the way.

@d70-t
Copy link
Contributor

d70-t commented Jan 21, 2022

Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.

True, a compressor-wrapper seems to be a relatively light way to implement such GPU aware stores. But I'm wondering if that's really the API we'll want to end up with. Probably we'll also want to end up having such compressor-wrappers as a simple way to make more compressor-implementations, but maybe we don't want to write them out at this function call. That's about what I meant with separating compressor-implementations and compressor-specifications. We likely want to create zlib-compressed data on the store (the specification), but might want to use the CuPyCPU(Zlib) implementation for the transformation at that very moment of writing the data.

Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information.

Do we need that information, when creating the array from scratch? The array will probably only contain the metadata keys and those should maybe be independent of the compressor-implementation.

Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type.

Yes, that's about what I had in mind. It would be great to have arrays of any kind on one side and stores and compression specifications on the other side and some flexible, auto-optimizing routing in between. I believe that the specific algorithm of how to compress / decompress and move the data in between should neither belong to the store nor to the array-type.

In particular, it seems to be a totally valid approach to use e.g. a CPU to write data to some (e.g. S3-) store and read data back from the same store later on via GPU.
If one might want to go totally crazy, one might even write some parts of an array using CPU and other parts using GPU (I know of some people in C++ communities who work on parallelizing loops such that parts of the same loop run on available CPUs and parts run on available GPUs in order to get the most out of a system).

@d70-t
Copy link
Contributor

d70-t commented Jan 21, 2022

I just stumbled again across the mentioned entrypoints... It's probably worth a thought if something roughly like:

compressors =[
    {"array": "numpy", "algorithm": "zlib": "compressor": Zlib, "priority": 1},
    {"array": "cupy", "algorithm": "zlib": "compressor": CuPyCPUCompressor(Zlib), "priority": 2},
    {"array": "cupy", "algorithm": "zlib": "compressor": CuPyGDSZlib, "priority": 1},
    ...
]

could be established via entrypoints? That list would be dependent on which libraries are present on the system and would help finding the right compressor implementation for a given combination of array and codec.... Still I don't have though yet about how that might interfere with filters, which may also play a role...

@madsbk
Copy link
Contributor Author

madsbk commented Jan 24, 2022

@d70-t I agree with your vision but I think we should move in small steps.

I think we all can agree that we need some way to specify the desired array-like type of a getitem operation. For backward compatibility and for convenience, I suggest that we specify this through the meta_array argument on zarr.Array creation. But this is only the default array-like type, I agree with @d70-t, it should be possible to overwrite the default type using something like:

selection = z.get(slice(3,7), meta_array=cupy.empty(())) 

And as @joshmoore suggest, we might even make it higher-level configurable (e.g. through a config file and/or context manager).

However, I think this is outside of the scope of this PR. The goal here is to avoid force-casting all data to NumPy arrays.

In any case, up to now, the CuPyCPUCompressor (which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr array z. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue that cupy_cpu_compressor should maybe not appear in .zarray metadata.

Again, I agree. I can move the implementation of CuPyCPUCompressor to test_cupy.py to make it clear that it is just a way to use CuPy arrays with the current API and is not part of the official API.

@d70-t
Copy link
Contributor

d70-t commented Jan 24, 2022

👍 for moving in small steps. And I also agree to the rest of the post @madsbk

The only thing vision-wise which I believe should be accounted for right now, would be that indications of the (in-memory) array type should not be part of the zarr-metadata (and I'm still uncertain about if it would be ok to do so as a part of a test. Probably that's ok). But that's more a personal feeling and should maybe be up to discussion and others might have different opinions.

raydouglass pushed a commit to rapidsai/kvikio that referenced this pull request Jan 28, 2022
@madsbk
Copy link
Contributor Author

madsbk commented Aug 2, 2022

This is ready for another round of reviews

cc. @jakirkham, @d70-t

@joshmoore
Copy link
Member

@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?

@madsbk
Copy link
Contributor Author

madsbk commented Aug 8, 2022

@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?

The codecov errors can be ignored. As far as I can see, all of them are false negatives.

@joshmoore
Copy link
Member

The codecov errors can be ignored. As far as I can see, all of them are false negatives.

Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?

@madsbk
Copy link
Contributor Author

madsbk commented Aug 11, 2022

Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?

You are absolutely right! No idea how I could miss that :)

Adding tests of meta_array when CuPy isn't available.

@madsbk
Copy link
Contributor Author

madsbk commented Sep 5, 2022

Hi @joshmoore, anything you need from my side to get progress here?

@joshmoore
Copy link
Member

Hi @madsbk. No, sorry. This just landed on the vacation ⛰️. I'll start reviewing for a 2.13 release ASAP.

@joshmoore
Copy link
Member

Rolling into a 2.13 pre-release. If anyone has any lingering concerns about the API, there would still be a (small) window to address them in.

@joshmoore joshmoore merged commit 43266ee into zarr-developers:main Sep 8, 2022
@joshmoore joshmoore mentioned this pull request Sep 8, 2022
6 tasks
madsbk added a commit to madsbk/kvikio that referenced this pull request Sep 9, 2022
rapids-bot bot pushed a commit to rapidsai/kvikio that referenced this pull request Sep 12, 2022
which include zarr-developers/zarr-python#934

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #129
@jakirkham
Copy link
Member

Thanks Mads for working on this and Josh for reviewing! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants