Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use = to reassign into object arrays in a group #502

Open
hyanwong opened this issue Nov 8, 2019 · 4 comments
Open

Can't use = to reassign into object arrays in a group #502

hyanwong opened this issue Nov 8, 2019 · 4 comments

Comments

@hyanwong
Copy link

hyanwong commented Nov 8, 2019

Minimal, reproducible code sample, a copy-pastable example if possible

import zarr
import numcodecs
import numpy as np
store = zarr.DirectoryStore('example.zarr')
g = zarr.group(store=store, overwrite=True)
d = g.create_dataset('foo', shape=0, chunks=10, dtype=np.int64)
d = g.create_dataset('bar', shape=0, chunks=10, dtype=object, object_codec=numcodecs.JSON())
g['foo'].append([1, 2, 3, 4])
g['bar'].append(["a", "b", "c", "d"])
b = np.array([1,0,0,1], dtype=bool)
g['foo'] = g['foo'][:][b]  # Works
g['bar'] = g['bar'][:][b]  # Fails, because object_codec not specified

Problem description

Can't use = to reassign into object arrays in a group. Error is ValueError: missing object_codec for object array, see https://stackoverflow.com/questions/58745967/how-to-cut-down-delete-a-zarr-array and below:

>>> g['bar'] = g['bar'][:][b]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/zarr/hierarchy.py", line 335, in __setitem__
    self.array(item, value, overwrite=True)
  File "/usr/local/lib/python3.7/site-packages/zarr/hierarchy.py", line 908, in array
    return self._write_op(self._array_nosync, name, data, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/zarr/hierarchy.py", line 628, in _write_op
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/zarr/hierarchy.py", line 915, in _array_nosync
    **kwargs)
  File "/usr/local/lib/python3.7/site-packages/zarr/creation.py", line 341, in array
    z = create(**kwargs)
  File "/usr/local/lib/python3.7/site-packages/zarr/creation.py", line 120, in create
    chunk_store=chunk_store, filters=filters, object_codec=object_codec)
  File "/usr/local/lib/python3.7/site-packages/zarr/storage.py", line 323, in init_array
    object_codec=object_codec)
  File "/usr/local/lib/python3.7/site-packages/zarr/storage.py", line 378, in _init_array_metadata
    raise ValueError('missing object_codec for object array')
ValueError: missing object_codec for object array

Version and installation information

Please provide the following:

  • Value of zarr.__version__: '2.3.2'
  • Value of numcodecs.__version__: '0.6.3'
  • Version of Python interpreter: Python 3.7.2 (default, Feb 12 2019, 08:15:36)
  • Operating system: Mac
  • How Zarr was installed: pip3 install zarr
@alimanfoo
Copy link
Member

Thanks @hyanwong, I get what's happening here now.

TLDR, here's how to get what (I think) you want to work:

import zarr
import numcodecs
import numpy as np
store = zarr.DirectoryStore('example.zarr')
g = zarr.group(store=store, overwrite=True)
g.create_dataset('bar', shape=0, chunks=10, dtype=object, object_codec=numcodecs.JSON())
g['bar'].append(["a", "b", "c", "d"])
b = np.array([1,0,0,1], dtype=bool)
new_bar = g['bar'][:][b]
g.create_dataset('bar', data=new_bar, chunks=10, dtype=object, object_codec=numcodecs.JSON(), overwrite=True)

Long explanation, if you have a group g and some numpy array x and you attempt to do an assignment like:

g['bar'] = x

...that is actually a shorthand for creating a new zarr array called "bar" as a member of the group g. If an array called "bar" happens to already exist, it will be deleted and a new array created with the new data in x. But when creating a new object array an object codec is needed, and you cannot provide an object codec via this shorthand syntax. Hence the explicit version works:

g.create_dataset("bar", data=x, dtype=object, chunks=10, object_codec=numcodecs.JSON())

@hyanwong
Copy link
Author

hyanwong commented Nov 11, 2019

Ah, I thought it might be something like that (ability to assign via = not allowing parameters such as object_codec). I was wondering if there was any mileage in copying the parameters which were used when setting up the original array into the new array, if the array already exists? Of course, this might cause problems for people who assume that the = operator is agnostic to the existence or otherwise of an identically named array.

Either way, it would be useful to include something about this in the documentation (or maybe I missed it).

There is a more general question of whether there is an efficient way to reassign a boolean indexed version of the same array back into the original zarr data store, without (necessarily) having to read the entire array into memory. That's a slightly different question (although I'm not sure if it's something peculiar that I want to do, but which wouldn't be of general use). However, going down this route, I can imagine it being a useful addition to be able to copy a zarr mask selection into a new zarr array via the = operator, and in this case it would be possible to use the object_codec and other parameters from the passed-in masked selection.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 14, 2019

Ah, I thought it might be something like that (ability to assign via = not allowing parameters such as object_codec). I was wondering if there was any mileage in copying the parameters which were used when setting up the original array into the new array, if the array already exists? Of course, this might cause problems for people who assume that the = operator is agnostic to the existence or otherwise of an identically named array.

Yeah, I think this might be asking too much of group item assignment (Group.__setitem__). I actually wonder if it should be removed altogether, because it is not always obvious what's happening. I.e., better to force people to use more explicit methods like Group.create_dataset().

Either way, it would be useful to include something about this in the documentation (or maybe I missed it).

Good idea.

There is a more general question of whether there is an efficient way to reassign a boolean indexed version of the same array back into the original zarr data store, without (necessarily) having to read the entire array into memory. That's a slightly different question (although I'm not sure if it's something peculiar that I want to do, but which wouldn't be of general use).

For that I would generally suggest to use dask. E.g., something like:

store = zarr.DirectoryStore('example.zarr')
root = zarr.group(store=store)
# create array to hold original data
foo = root.create_dataset('foo', shape=100, chunks=10, ...)
# store original data into foo somehow
# create a boolean array selecting items in foo
sel = ... # boolean array, same shape as foo
# create an array to hold the selected data
bar = root.create_dataset('bar', shape=np.count_nonzero(sel), chunks=10, ...)
# use dask to select and store
import dask.array as da
da.from_array(foo)[sel].rechunk(bar.chunks).store(bar, lock=False)

However, going down this route, I can imagine it being a useful addition to be able to copy a zarr mask selection into a new zarr array via the = operator, and in this case it would be possible to use the object_codec and other parameters from the passed-in masked selection.

For all computations on zarr arrays, including copying data from one array to another, I would generally recommend to use dask. We're trying to avoid putting any logic in zarr for things that could be done with dask and generally will be done better with dask given its ability to parallelize work.

Note that if you need to create a new array with some or all of the same parameters as an existing array, there are convenience functions, e.g., zarr.zeros_like(), also available as a method on Group objects.

Hth.

@jakirkham
Copy link
Member

What about just reusing the same object_codec when creating the new Array?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants