-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative chunk indexes, offset chunk origin #9
Comments
This issue may interact with zarr-developers/zarr-python#233 |
Thanks @chairmank for this idea, apologies for the delay in responding. This is a neat idea, it is appealing to have symmetry so it is as easy to grow an array in both "positive" and "negative" directions. Or put another way, so it is as easy to append as to prepend data. To make this work in general, where you might want to grow an array in the negative direction but by an amount that is not an exact multiple of the chunk size, I think another metadata item would be needed in addition to Because this would involve a spec change, we need to consider this carefully. If you could provide a bit of information about the use case that you have for this feature, that would be very helpful. I would also like to wait to have use cases from at least one or two others. Also there are at least two other implementations of the Zarr spec currently in progress, so I'd like to keep things stable for a while, and would want to consult with them to see if they agreed this feature was worth implementing on their side before going ahead. So maybe we can mull this over for a bit. As an aside, it would be nice to find a way of collecting together and advertising all proposed changes to the spec, so that folks could easily see what major changes are being considered, and also have an easy way to vote on them so we can canvas opinion. Something like the enhancement proposals used in other projects (PEP/NEP)? Maybe the spec should even be broken out into a separate GitHub repo to make this easier? @jakirkham any thoughts? |
@chairmank @alimanfoo I'm really keen on this change. Here at the Met Office we have a lot of big but also fast-moving datasets that would benefit from this. I've implemented this spec change (sort of) and am convinced it's got legs. I've written a blog https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671 about why (it's important) and a workaround we've implemented. There is also a binder if you want to jump straight in: Your thoughts appreciated. |
@tam203 that is a beautiful write-up! Knowing this is an important feature for your use cases, I'm more than happy to consider this in scope for the next spec version. We've been having some initial discussions around the next spec version via the zarr/n5 conference calls (#315), things are just getting going but we are gradually building up some momentum to tackle a new spec version. Calls are open so you'd be welcome to join if interested. There is also a new repo dedicate to spec development - https://github.com/zarr-developers/zarr-specs - I think I will migrate this issue over there. |
I just want to give this proposal a big +1. Indeed for my use case I am already using a user-space attribute specifying the array offset, e.g. here: https://github.com/esa-esdl/ESDL.jl/blob/zarrOnly/src/Cubes/ZarrCubes.jl#L70 , because I neded chunkings that don't start in the [0,0] corner. So that would be at least one additional real-life use-case. |
This subject came up again on this week's Pangeo call, as folks at the US National Water Center are interested to test putting out National Water Model forecasts using Zarr. Just letting folks know there is still keen interest! |
a big +1 from us with a very similar problem to what the Met Office is tackling (are they though? The latest post is from 2019, and the watch this space notice didn't get a follow-up). Chunk id numbers potentially growing huge. someone commenting on the original blog post suggested setting the origin to A simpler workaround (not requiring code changes), might be using zarr groups per day (or any other time unit), and just deleting these groups when needed. This may not fit everyone's use case. Maybe not even ours - we're still working on it. |
There has been extensive discussion of this both in #122 and in the Zarr community meeting regarding this issue. This is not planned to be part of the initial zarr v3 spec, but it would be great to standardize this as an extension soon. |
This is a feature request. If you think that it is worth doing, I volunteer to update the zarr specification and implementation.
The v2 specification states
Together, these statements imply the following restrictions:
[0, 0, ..., 0]
origin "corner" of a N-dimensional array must coincide with the origin corner of the[0, 0, ..., 0]
th chunkThese restrictions make it convenient to grow/shrink an N-dimensional array along the edges that are far from zero but inconvenient to grow/shrink along the zero-index edges. For example, consider this chunked array with shape
[3, 3]
:If this array is split into chunks of
[2, 2]
, the chunks are[0, 0]
is[[0, 1], [3, 4]]
[0, 1]
is[[2, undefined], [5, undefined]]
[1, 0]
is[[6, 7], [undefined, undefined]]
[1, 1]
is[[8, undefined], [undefined, undefined]]
To concatenate an array of zeroes
[[0, 0, 0]]
on the non-zero edge of the 0th dimension, I only need to change the shape of the array to[4, 3]
and update chunks[1, 0]
and[1, 1]
:[0, 0]
is unchanged[0, 1]
is unchanged[1, 0]
becomes[[6, 7], [0, 0]]
[1, 1]
becomes[[8, undefined], [0, undefined]]
However, to concatenate on the opposite edge, I need to shift the chunk origin and can not reuse any of the existing chunks:
[0, 0]
becomes[[0, 0], [0, 1]]
[0, 1]
becomes[[0, undefined], [2, undefined]]
[1, 0]
becomes[[3, 4], [6, 7]]
[1, 1]
becomes[[5, undefined], [8, undefined]]
This rechunking is expensive for big arrays that are repeatedly grown in the "negative" direction.
I propose relaxing restrictions to make this append easier. Specifically, I propose the following:
chunk_origin
(feel free to invent a better name!), which specifies the location in the N-dimensional array of the origin "corner" of the[0, 0, ..., 0]
th chunk. If unspecified, the default value ofchunk_origin
is[0, 0, ..., 0]
(meaning that the chunk origin coincides with the origin of the array itself), to preserve backwards compatibility.In the example above, we can efficiently append along the zero edge by changing the shape to
[4, 3]
and changing the chunk origin from[0, 0]
to[1, 0]
and adding new chunks with negative indexes:[-1, 0]
with contents[[undefined, undefined], [0, 0]]
[-1, 1]
with contents[[undefined, undefined], [0, undefined]]
[0, 0]
is unchanged[0, 1]
is unchanged[1, 0]
is unchanged[1, 1]
is unchangedWhat do you think?
The text was updated successfully, but these errors were encountered: