Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing dask array to zarr leads to a TypeError #962

Open
michael-sutherland opened this issue Feb 9, 2022 · 24 comments
Open

Passing dask array to zarr leads to a TypeError #962

michael-sutherland opened this issue Feb 9, 2022 · 24 comments

Comments

@michael-sutherland
Copy link

michael-sutherland commented Feb 9, 2022

Minimal, reproducible code sample, a copy-pastable example if possible

import numpy as np
import zarr
import os
import dask.array as da

from ome_zarr.io import parse_url
from ome_zarr.writer import write_image

path = "test_dask_image.zarr"
os.mkdir(path)

mean_val = 10
rng = np.random.default_rng(0)
data = rng.poisson(mean_val, size=(50, 128, 128)).astype(np.uint8)
data = da.from_array(data)  #  <--- !!!THIS IS WHAT BREAKS THE EXAMPLE!!!

# write the image data
store = parse_url(path, mode="w").store
root = zarr.group(store=store)
write_image(image=data, group=root, chunks=(1, 128, 128), axes="zyx")
root.attrs["omero"] = {
    "channels": [{
        "color": "00FFFF",
        "window": {"start": 0, "end": 20},
        "label": "random",
        "active": True,
    }]
}

Problem description

Get an error when trying to write a dask array to a zarr file. I took the proposed example from ome/ome-zarr-py#121 and wrapped the numpy array in a dask array. It appears that the dask array function astype doesn't support the parameter "order". The example code works when using a numpy array for me. Here's the resulting traceback:

Traceback (most recent call last):
  File "/home/sutherland/git/xray_napari/write_zarr_with_dask_failure.py", line 20, in <module>
    write_image(image=data, group=root, chunks=(1, 128, 128), axes="zyx")
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/ome_zarr/writer.py", line 178, in write_image
    write_multiscale(image, group, chunks=chunks, fmt=fmt, axes=axes)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/ome_zarr/writer.py", line 105, in write_multiscale
    group.create_dataset(str(path), data=dataset, chunks=chunks)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/hierarchy.py", line 808, in create_dataset
    return self._write_op(self._create_dataset_nosync, name, **kwargs)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/hierarchy.py", line 661, in _write_op
    return f(*args, **kwargs)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/hierarchy.py", line 824, in _create_dataset_nosync
    a = array(data, store=self._store, path=path, chunk_store=self._chunk_store,
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/creation.py", line 377, in array
    z[...] = data
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1224, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1319, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1610, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1682, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1871, in _chunk_setitems
    cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1871, in <listcomp>
    cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/zarr/core.py", line 1924, in _process_for_setitem
    chunk = value.astype(self._dtype, order=self._order, copy=False)
  File "/home/sutherland/anaconda3/envs/napari2/lib/python3.9/site-packages/dask/array/core.py", line 2086, in astype
    raise TypeError(
TypeError: astype does not take the following keyword arguments: ['order']

Version and installation information

Please provide the following:

  • Value of zarr.__version__: 2.10.3
  • Value of numcodecs.__version__: 0.9.1
  • Version of Python interpreter: 3.9.7
  • Operating system (Linux/Windows/Mac): RHEL 8
  • How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): using pip into conda env

Also, if you think it might be relevant, please provide the output from pip freeze or
conda env export depending on which was used to install Zarr.

@joshmoore
Copy link
Member

Hi @michael-sutherland. Can you include which version of dask?

@michael-sutherland
Copy link
Author

dask: 2021.12.0

@michael-sutherland
Copy link
Author

@joshmoore
Copy link
Member

Thanks for the info, @michael-sutherland. Looking at the stacktrace again, any objections to migrating this to github.com/ome/ome-zarr-py ?

cc: @sbesson @will-moore

@michael-sutherland
Copy link
Author

Oops! Sorry if I misidentified which library the problem was in. Would you like me to migrate the issue or would you prefer to do it?

@joshmoore
Copy link
Member

joshmoore commented Feb 11, 2022

No worries. Hmmm.... looks like I can't transfer automatically between orgs anyway. See ome/ome-zarr-py#169

@will-moore
Copy link

will-moore commented Feb 15, 2022

As mentioned on #962, the example can be simplified to remove ome_zarr and get the same exception:

import numpy as np
import zarr
from zarr.storage import FSStore
import os
import dask.array as da

path = "test_dask_image.zarr"
os.mkdir(path)

mean_val = 10
rng = np.random.default_rng(0)
data = rng.poisson(mean_val, size=(50, 128, 128)).astype(np.uint8)
data = da.from_array(data)  #  <--- !!!THIS IS WHAT BREAKS THE EXAMPLE!!!

# write the image data
store = FSStore(path, mode="w")
root = zarr.group(store=store)
root.create_dataset("0", data=data, chunks=(1, 128, 128))

@joshmoore
Copy link
Member

Updated @will-moore's example to define mean_val = 10.

@joshmoore
Copy link
Member

Ok. I assume this is related to @madsbk's #934 in the sense that typically dask arrays are assumed to wrap zarr arrays rather than vice versa.

cc: @jakirkham

@GenevieveBuckley
Copy link

Hi @michael-sutherland, is there a reason you don't want to use the dask.array.to_zarr() function?

I imagine there might be some cases where it makes sense, and this example could have lost that context when it was simplified to make debugging easier.

@GenevieveBuckley
Copy link

I poked around with this a little, and there seem to be two places where things go wrong: one in dask, and one in zarr.

  1. In dask/array/core.py, in the astype method of the Array class

This line:
https://github.com/dask/dask/blob/cfaf931072ae079313761d4d61e25956029abf6c/dask/array/core.py#L2237

If it were changed to this, I think we'd fix the dask part of the problem:

extra = set(kwargs) - {"casting", "copy", "order"}
  1. In zarr, zarr/core.py(2189)_encode_chunk()
    There is an ensure_ndarray function, which checks to see if the chunk can be written properly (this check happens with numcodecs). But, because chunk here is a chunk of the dask array instead of a numpy array, the numcodecs library chokes on it. What it sees in the buffer is not a numpy array but instead something different, and so it throws an error.
> /Users/genevieb/mambaforge/envs/dask-dev/lib/python3.9/site-packages/zarr/core.py(2189)_encode_chunk()
   2187
   2188         # check object encoding
-> 2189         if ensure_ndarray(chunk).dtype == object:
   2190             raise RuntimeError('cannot write object array without object codec')

Zarr could potentially work around this second issue by trying to .compute() the dask chunk in ensure_ndarray, before passing it along. But then there's some question about whether that's actually a sensible thing to do or not. Things would probably also become uglier in the zarr code, if now zarr needs to check what type of array it is handling, especially since numpy and dask are not the only two options for array types.

Josh says above that typically it is expected dask will wrap zarr, and not the other way around. Making changes to (2) above would be a bit in conflict with that expectation.

@vietnguyengit
Copy link

vietnguyengit commented Jul 26, 2022

Hi guys, Google brings me here, I recently face the same error as well and I used dask.array.to_zarr() method:

image
image

@GenevieveBuckley
Copy link

On the dask side of things, I've opened an issue and PR:

As discussed above, this won't completely fix the problem here.

@vietnguyengit
Copy link

Thanks @GenevieveBuckley , hopefully, this will solve the issue on my side.

@GenevieveBuckley
Copy link

Yes, let us know!

@joshmoore
Copy link
Member

joshmoore commented Jul 27, 2022

@GenevieveBuckley #962 (comment) 2. There is an ensure_ndarray function, which checks to see if the chunk can be written properly (this check happens with numcodecs). But, because chunk here is a chunk of the dask array instead of a numpy array, the numcodecs library chokes on it.

cc: @madsbk just in case his recent work will or could handle this.

@madsbk
Copy link
Contributor

madsbk commented Aug 2, 2022

Zarr could potentially work around this second issue by trying to .compute() the dask chunk in ensure_ndarray, before passing it along. But then there's some question about whether that's actually a sensible thing to do or not. Things would probably also become uglier in the zarr code, if now zarr needs to check what type of array it is handling, especially since numpy and dask are not the only two options for array types.

I agree, calling .compute() in ensure_ndarray is doable but I don't think it is desirable. We already support any data type that implements NDArrayLike so it should be possible to wrap a dask.array in a class that implements NDArrayLike and calls .compute() when accessed.

@GenevieveBuckley
Copy link

I'm +1 on redirecting everyone back to use the dask.array.to_zarr method instead of making additional changes here.

jsignell pushed a commit to dask/dask that referenced this issue Aug 4, 2022
…9317)

Allows the `order` kwarg to be passed in to the dask `astype` method without triggering an error.

Our friends over at zarr have found a small bug in dask. While the numpy `astype` method allows the user to use an `order` keyword argument ([docs here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html)), the corresponding dask `astype` method produces an error 

Why we should do this:
1. It's not always as obvious as telling the user to edit the line in their code that uses `astype`, since it's often used very indirectly. Here's one example: zarr-developers/zarr-python#962 (comment)
2.  It reflects better on dask not to have odd errors popping up, even if this PR won't completely solve the issue being discussed over at zarr zarr-developers/zarr-python#962
@joshmoore
Copy link
Member

@michael-sutherland, can you let us know if you've found a solution that works for you?

@michael-sutherland
Copy link
Author

I ended up moving to a custom solution using an H5 file with a similar structure. My application involved displaying a 2D mosiac that was being generated from a tool (in this case an X-ray microscope) in "real time". I don't know the area ahead of time and I needed to be able to expand the arrays and update the pyramid as I went. It is all working now, although I'd prefer a more standard format if possible. Also, expanding "to the left and up" is painful and involves copying, which I might be able to work around in a zarr file structure. If I can get the time, I'd like to try porting it to zarr.

Sorry if you were only doing this for me, I think supporting dask and other numpy-like arrays is important, although I think doing a custom call to "compute()" isn't the answer since that is so dask specific. Maybe wrapping in a call to np.array(), which will pull in data from h5py or dask or whatever lazy loading system someone might be using would be better? It also won't make a copy if it is already a numpy array (as far as I know).

@madsbk
Copy link
Contributor

madsbk commented Aug 8, 2022

No worries @michael-sutherland, it is always good to discuss design decisions :)

I am not sure that converting Dask Arrays to a local ndarray seamlessly is a good idea. The data of a Dask Arrays is often distributed between multiple machines so gathering all data to a single local ndarray is very expensive in terms of memory usage and performance.
In most cases, I think it is more desiable to use something like DataFrame.map_partitions() to operate on each local Zarr array.

@joshmoore joshmoore changed the title Can't write dask array to zarr file Passing dask array to zarr leads to a TypeError Nov 15, 2022
@joshmoore
Copy link
Member

cc: @mrocklin just to clarify that it wasn't just lack of Dask-:heart: but the deeper question of to compute() or not to compute(), in which case the duck-typing (.chunks, etc.) aren't enough anyway. There would additionally need to be some recognition of the delayed-ness.

@mrocklin
Copy link
Contributor

but the deeper question of to compute() or not to compute(),

Typically we handle this by using protocols like __array__. For example if Zarr were to call np.asarray on every chunk before inserting it into some store then that might help to ensure that things work well.

@mrocklin
Copy link
Contributor

Chatting live with @joshmoore . Some thoughts:

Maybe things work now?

@GenevieveBuckley did some work, maybe stuff works. We should test this (I don't know this, Josh is suggesting it, don't blame if he's wrong 🙂 )

Maybe call np.asarray on each sliced chunk?

If it doesn't work, make sure that the sliced chunk you're about to write is concrete and in memory (probably a numpy but maybe something else, like a cupy array or some other buffer thing). In dumb code, this probably looks like this:

for slice in ...:
    chunk = array[slice]
    chunk = np.asarray(chunk)
    write(chunk, path)

Maybe use Dask Array?

This is a bit circular, but Zarr could use dask array

def give_array_to_zarr(self, arr, path):
    if arr.shape == self.chunks:
        # do numpy thing
    else:
        # Dask
        x = da.from_array(array)
        x.to_zarr(...)

We need to be a little careful to avoid an infinite zarr-dask-zarr loop, but I think the first if statement handles it.

There's also some concern of this code being split between two repositories. I generously offer that you all own this code in the future 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants