Negative chunk indexes, offset chunk origin #9

chairmank · 2018-06-07T05:09:24Z

This is a feature request. If you think that it is worth doing, I volunteer to update the zarr specification and implementation.

The v2 specification states

For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key “0.0”; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key “2.4”; etc.

Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.

Together, these statements imply the following restrictions:

the [0, 0, ..., 0] origin "corner" of a N-dimensional array must coincide with the origin corner of the [0, 0, ..., 0]th chunk
overhanging chunks may only appear at the edges of the array that are far from the origin

These restrictions make it convenient to grow/shrink an N-dimensional array along the edges that are far from zero but inconvenient to grow/shrink along the zero-index edges. For example, consider this chunked array with shape [3, 3]:

[[ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8]]

If this array is split into chunks of [2, 2], the chunks are

chunk [0, 0] is [[0, 1], [3, 4]]
chunk [0, 1] is [[2, undefined], [5, undefined]]
chunk [1, 0] is [[6, 7], [undefined, undefined]]
chunk [1, 1] is [[8, undefined], [undefined, undefined]]

To concatenate an array of zeroes [[0, 0, 0]] on the non-zero edge of the 0th dimension, I only need to change the shape of the array to [4, 3] and update chunks [1, 0] and [1, 1]:

[[ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8],
 [ 0, 0, 0]]

chunk [0, 0] is unchanged
chunk [0, 1] is unchanged
chunk [1, 0] becomes [[6, 7], [0, 0]]
chunk [1, 1] becomes [[8, undefined], [0, undefined]]

However, to concatenate on the opposite edge, I need to shift the chunk origin and can not reuse any of the existing chunks:

[[ 0, 0, 0],
 [ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8]]

chunk [0, 0] becomes [[0, 0], [0, 1]]
chunk [0, 1] becomes [[0, undefined], [2, undefined]]
chunk [1, 0] becomes [[3, 4], [6, 7]]
chunk [1, 1] becomes [[5, undefined], [8, undefined]]

This rechunking is expensive for big arrays that are repeatedly grown in the "negative" direction.

I propose relaxing restrictions to make this append easier. Specifically, I propose the following:

Overhanging chunks may appear along any edge of the array
The array metadata has a new key, chunk_origin (feel free to invent a better name!), which specifies the location in the N-dimensional array of the origin "corner" of the [0, 0, ..., 0]th chunk. If unspecified, the default value of chunk_origin is [0, 0, ..., 0] (meaning that the chunk origin coincides with the origin of the array itself), to preserve backwards compatibility.
Chunk indices may be negative

In the example above, we can efficiently append along the zero edge by changing the shape to [4, 3] and changing the chunk origin from [0, 0] to [1, 0] and adding new chunks with negative indexes:

add chunk [-1, 0] with contents [[undefined, undefined], [0, 0]]
add chunk [-1, 1] with contents [[undefined, undefined], [0, undefined]]
chunk [0, 0] is unchanged
chunk [0, 1] is unchanged
chunk [1, 0] is unchanged
chunk [1, 1] is unchanged

What do you think?

The text was updated successfully, but these errors were encountered:

chairmank · 2018-06-21T05:32:07Z

This issue may interact with zarr-developers/zarr-python#233

alimanfoo · 2018-06-21T21:24:45Z

Thanks @chairmank for this idea, apologies for the delay in responding. This is a neat idea, it is appealing to have symmetry so it is as easy to grow an array in both "positive" and "negative" directions. Or put another way, so it is as easy to append as to prepend data.

To make this work in general, where you might want to grow an array in the negative direction but by an amount that is not an exact multiple of the chunk size, I think another metadata item would be needed in addition to chunk_origin, something like array_origin which records the offset to the beginning of the array within the chunks on the leading edge. This would allow full symmetry in the sense that chunks could then overlap the leading edge.

Because this would involve a spec change, we need to consider this carefully. If you could provide a bit of information about the use case that you have for this feature, that would be very helpful. I would also like to wait to have use cases from at least one or two others. Also there are at least two other implementations of the Zarr spec currently in progress, so I'd like to keep things stable for a while, and would want to consult with them to see if they agreed this feature was worth implementing on their side before going ahead. So maybe we can mull this over for a bit.

As an aside, it would be nice to find a way of collecting together and advertising all proposed changes to the spec, so that folks could easily see what major changes are being considered, and also have an easy way to vote on them so we can canvas opinion. Something like the enhancement proposals used in other projects (PEP/NEP)? Maybe the spec should even be broken out into a separate GitHub repo to make this easier? @jakirkham any thoughts?

tam203 · 2019-03-15T11:10:48Z

@chairmank @alimanfoo I'm really keen on this change. Here at the Met Office we have a lot of big but also fast-moving datasets that would benefit from this.

I've implemented this spec change (sort of) and am convinced it's got legs. I've written a blog https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671 about why (it's important) and a workaround we've implemented.

There is also a binder if you want to jump straight in:

Your thoughts appreciated.

alimanfoo · 2019-03-15T11:39:55Z

@tam203 that is a beautiful write-up! Knowing this is an important feature for your use cases, I'm more than happy to consider this in scope for the next spec version. We've been having some initial discussions around the next spec version via the zarr/n5 conference calls (#315), things are just getting going but we are gradually building up some momentum to tackle a new spec version. Calls are open so you'd be welcome to join if interested. There is also a new repo dedicate to spec development - https://github.com/zarr-developers/zarr-specs - I think I will migrate this issue over there.

meggart · 2019-03-18T07:54:04Z

I just want to give this proposal a big +1. Indeed for my use case I am already using a user-space attribute specifying the array offset, e.g. here: https://github.com/esa-esdl/ESDL.jl/blob/zarrOnly/src/Cubes/ZarrCubes.jl#L70 , because I neded chunkings that don't start in the [0,0] corner. So that would be at least one additional real-life use-case.

rsignell-usgs · 2019-06-28T12:19:05Z

This subject came up again on this week's Pangeo call, as folks at the US National Water Center are interested to test putting out National Water Model forecasts using Zarr. Just letting folks know there is still keen interest!

bluppfisk · 2023-01-19T16:46:24Z

a big +1 from us with a very similar problem to what the Met Office is tackling (are they though? The latest post is from 2019, and the watch this space notice didn't get a follow-up).
From the proposed workaround, and proposed solutions, I see an issue that may or may not exist in my head only:

Chunk id numbers potentially growing huge. someone commenting on the original blog post suggested setting the origin to (shape+origin-1)%shape to alleviate this problem. I'm insufficiently based in maths to spot whether this may ever cause trouble.

A simpler workaround (not requiring code changes), might be using zarr groups per day (or any other time unit), and just deleting these groups when needed. This may not fit everyone's use case. Maybe not even ours - we're still working on it.

jbms · 2023-01-19T17:13:32Z

There has been extensive discussion of this both in #122 and in the Zarr community meeting regarding this issue. This is not planned to be part of the initial zarr v3 spec, but it would be great to standardize this as an extension soon.

alimanfoo transferred this issue from zarr-developers/zarr-python Mar 15, 2019

alimanfoo changed the title ~~negative chunk indexes, offset chunk origin~~ Negative chunk indexes, offset chunk origin Mar 15, 2019

tam203 mentioned this issue Mar 15, 2019

Would like to be involved in spec change discussions. #10

Closed

jakirkham mentioned this issue Feb 9, 2022

Support for non-zero origin #122

Open

jstriebel added core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 16, 2022

jhamman mentioned this issue Aug 8, 2024

Add spec document earth-mover/icechunk#1

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative chunk indexes, offset chunk origin #9

Negative chunk indexes, offset chunk origin #9

chairmank commented Jun 7, 2018

chairmank commented Jun 21, 2018

alimanfoo commented Jun 21, 2018

tam203 commented Mar 15, 2019

alimanfoo commented Mar 15, 2019

meggart commented Mar 18, 2019

rsignell-usgs commented Jun 28, 2019

bluppfisk commented Jan 19, 2023

jbms commented Jan 19, 2023

Negative chunk indexes, offset chunk origin #9

Negative chunk indexes, offset chunk origin #9

Comments

chairmank commented Jun 7, 2018

chairmank commented Jun 21, 2018

alimanfoo commented Jun 21, 2018

tam203 commented Mar 15, 2019

alimanfoo commented Mar 15, 2019

meggart commented Mar 18, 2019

rsignell-usgs commented Jun 28, 2019

bluppfisk commented Jan 19, 2023

jbms commented Jan 19, 2023