Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative chunk indexes, offset chunk origin #9

Open
chairmank opened this issue Jun 7, 2018 · 8 comments
Open

Negative chunk indexes, offset chunk origin #9

chairmank opened this issue Jun 7, 2018 · 8 comments
Labels
protocol-extension Protocol extension related issue

Comments

@chairmank
Copy link

This is a feature request. If you think that it is worth doing, I volunteer to update the zarr specification and implementation.

The v2 specification states

For example, given an array with shape (10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides data for rows 0-1000 and columns 0-1000 and is stored under the key “0.0”; the chunk with indices (2, 4) provides data for rows 2000-3000 and columns 4000-5000 and is stored under the key “2.4”; etc.

Note that all chunks in an array have the same shape. If the length of any array dimension is not exactly divisible by the length of the corresponding chunk dimension then some chunks will overhang the edge of the array. The contents of any chunk region falling outside the array are undefined.

Together, these statements imply the following restrictions:

  • the [0, 0, ..., 0] origin "corner" of a N-dimensional array must coincide with the origin corner of the [0, 0, ..., 0]th chunk
  • overhanging chunks may only appear at the edges of the array that are far from the origin

These restrictions make it convenient to grow/shrink an N-dimensional array along the edges that are far from zero but inconvenient to grow/shrink along the zero-index edges. For example, consider this chunked array with shape [3, 3]:

[[ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8]]

If this array is split into chunks of [2, 2], the chunks are

  • chunk [0, 0] is [[0, 1], [3, 4]]
  • chunk [0, 1] is [[2, undefined], [5, undefined]]
  • chunk [1, 0] is [[6, 7], [undefined, undefined]]
  • chunk [1, 1] is [[8, undefined], [undefined, undefined]]

To concatenate an array of zeroes [[0, 0, 0]] on the non-zero edge of the 0th dimension, I only need to change the shape of the array to [4, 3] and update chunks [1, 0] and [1, 1]:

[[ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8],
 [ 0, 0, 0]]
  • chunk [0, 0] is unchanged
  • chunk [0, 1] is unchanged
  • chunk [1, 0] becomes [[6, 7], [0, 0]]
  • chunk [1, 1] becomes [[8, undefined], [0, undefined]]

However, to concatenate on the opposite edge, I need to shift the chunk origin and can not reuse any of the existing chunks:

[[ 0, 0, 0],
 [ 0, 1, 2],
 [ 3, 4, 5],
 [ 6, 7, 8]]
  • chunk [0, 0] becomes [[0, 0], [0, 1]]
  • chunk [0, 1] becomes [[0, undefined], [2, undefined]]
  • chunk [1, 0] becomes [[3, 4], [6, 7]]
  • chunk [1, 1] becomes [[5, undefined], [8, undefined]]

This rechunking is expensive for big arrays that are repeatedly grown in the "negative" direction.

I propose relaxing restrictions to make this append easier. Specifically, I propose the following:

  • Overhanging chunks may appear along any edge of the array
  • The array metadata has a new key, chunk_origin (feel free to invent a better name!), which specifies the location in the N-dimensional array of the origin "corner" of the [0, 0, ..., 0]th chunk. If unspecified, the default value of chunk_origin is [0, 0, ..., 0] (meaning that the chunk origin coincides with the origin of the array itself), to preserve backwards compatibility.
  • Chunk indices may be negative

In the example above, we can efficiently append along the zero edge by changing the shape to [4, 3] and changing the chunk origin from [0, 0] to [1, 0] and adding new chunks with negative indexes:

  • add chunk [-1, 0] with contents [[undefined, undefined], [0, 0]]
  • add chunk [-1, 1] with contents [[undefined, undefined], [0, undefined]]
  • chunk [0, 0] is unchanged
  • chunk [0, 1] is unchanged
  • chunk [1, 0] is unchanged
  • chunk [1, 1] is unchanged

What do you think?

@chairmank
Copy link
Author

This issue may interact with zarr-developers/zarr-python#233

@alimanfoo
Copy link
Member

Thanks @chairmank for this idea, apologies for the delay in responding. This is a neat idea, it is appealing to have symmetry so it is as easy to grow an array in both "positive" and "negative" directions. Or put another way, so it is as easy to append as to prepend data.

To make this work in general, where you might want to grow an array in the negative direction but by an amount that is not an exact multiple of the chunk size, I think another metadata item would be needed in addition to chunk_origin, something like array_origin which records the offset to the beginning of the array within the chunks on the leading edge. This would allow full symmetry in the sense that chunks could then overlap the leading edge.

Because this would involve a spec change, we need to consider this carefully. If you could provide a bit of information about the use case that you have for this feature, that would be very helpful. I would also like to wait to have use cases from at least one or two others. Also there are at least two other implementations of the Zarr spec currently in progress, so I'd like to keep things stable for a while, and would want to consult with them to see if they agreed this feature was worth implementing on their side before going ahead. So maybe we can mull this over for a bit.

As an aside, it would be nice to find a way of collecting together and advertising all proposed changes to the spec, so that folks could easily see what major changes are being considered, and also have an easy way to vote on them so we can canvas opinion. Something like the enhancement proposals used in other projects (PEP/NEP)? Maybe the spec should even be broken out into a separate GitHub repo to make this easier? @jakirkham any thoughts?

@tam203
Copy link

tam203 commented Mar 15, 2019

@chairmank @alimanfoo I'm really keen on this change. Here at the Met Office we have a lot of big but also fast-moving datasets that would benefit from this.

I've implemented this spec change (sort of) and am convinced it's got legs. I've written a blog https://medium.com/informatics-lab/creating-a-data-format-for-high-momentum-datasets-a394fa48b671 about why (it's important) and a workaround we've implemented.

There is also a binder if you want to jump straight in: Binder

Your thoughts appreciated.

@alimanfoo
Copy link
Member

@tam203 that is a beautiful write-up! Knowing this is an important feature for your use cases, I'm more than happy to consider this in scope for the next spec version. We've been having some initial discussions around the next spec version via the zarr/n5 conference calls (#315), things are just getting going but we are gradually building up some momentum to tackle a new spec version. Calls are open so you'd be welcome to join if interested. There is also a new repo dedicate to spec development - https://github.com/zarr-developers/zarr-specs - I think I will migrate this issue over there.

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Mar 15, 2019
@alimanfoo alimanfoo changed the title negative chunk indexes, offset chunk origin Negative chunk indexes, offset chunk origin Mar 15, 2019
@meggart
Copy link
Member

meggart commented Mar 18, 2019

I just want to give this proposal a big +1. Indeed for my use case I am already using a user-space attribute specifying the array offset, e.g. here: https://github.com/esa-esdl/ESDL.jl/blob/zarrOnly/src/Cubes/ZarrCubes.jl#L70 , because I neded chunkings that don't start in the [0,0] corner. So that would be at least one additional real-life use-case.

@rsignell-usgs
Copy link

This subject came up again on this week's Pangeo call, as folks at the US National Water Center are interested to test putting out National Water Model forecasts using Zarr. Just letting folks know there is still keen interest!

@jstriebel jstriebel added core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 16, 2022
@bluppfisk
Copy link

a big +1 from us with a very similar problem to what the Met Office is tackling (are they though? The latest post is from 2019, and the watch this space notice didn't get a follow-up).
From the proposed workaround, and proposed solutions, I see an issue that may or may not exist in my head only:

Chunk id numbers potentially growing huge. someone commenting on the original blog post suggested setting the origin to (shape+origin-1)%shape to alleviate this problem. I'm insufficiently based in maths to spot whether this may ever cause trouble.

A simpler workaround (not requiring code changes), might be using zarr groups per day (or any other time unit), and just deleting these groups when needed. This may not fit everyone's use case. Maybe not even ours - we're still working on it.

@jbms
Copy link
Contributor

jbms commented Jan 19, 2023

There has been extensive discussion of this both in #122 and in the Zarr community meeting regarding this issue. This is not planned to be part of the initial zarr v3 spec, but it would be great to standardize this as an extension soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

8 participants