Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP0003 Variable chunks #18

Merged
merged 4 commits into from
Oct 21, 2022
Merged

Conversation

martindurant
Copy link
Member

Very preliminary version for discussion. Happy to start a github discussion or have a thread here. If responses are mostly favourable, I can make a PR to zarr-specs with the change and scope out how much work it would involve in zarr-python.

@martindurant
Copy link
Member Author

cc @rabernat , who has alternative ideas of how this functionality can be achieved.

@martindurant
Copy link
Member Author

Another draft of the same ideas, to be merged into here: https://hackmd.io/NaMo9YnBSFiZiO-ds1SFMA

cc @ivirshup

@MSanKeys963
Copy link
Member

Hi @martindurant.

Thanks for working on this. Here's my suggestion: We should complete this PR (ZEP0003) and merge it.

After merging, we can start the discussion in PR against this ZEP in the zarr-specs repo.

draft/ZEP0003.md Outdated Show resolved Hide resolved
## Detailed description

Currently, array metadata looks something like
```json
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for zarr v2 but does not match the current state of zarr v3. I think it would be helpful to revise this to apply to zarr v3, since I assume your intent here is to propose this as a zarr v3 extension.

The current zarr v3 spec seems to be designed to make the chunking scheme itself an extension point:

https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid

However, that extension point design has not really been discussed at all, nor have any alternative grid types been proposed, other than this one.

This proposal could simply be a new chunk_grid "type". But I am not sure if that is the best fit --- this proposal allows the non-uniform chunking to be specified for just some of the dimensions. Additionally, the v3 spec has "separator" as a field of the chunk grid type, but it equally applies to both regular and rectilinear grids.

It might be worth considering if there are any other grid types that would be useful, and if so, how they might interact with this proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, totally agree that, as far as the zarr-specs goes, a new subsection in that area is the way to go - with the regular grid being a special case. However, the spec does not go as far as to explicitly define how the chunks are specified in the metadata, it is only descriptive text. Is there a jsonschema somewhere I should propose a change to? When coming to write this ZEP, that's what I had assumed would be the case, but I cannot find one.

The separator is not material in this proposal - any separator can be used equally with regular or irregular grids.

if there are any other grid types that would be useful

Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!

Copy link

@jbms jbms Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no json schema. The zarr v3 spec does define the json metadata representation, though. For example this is where the existing regular chunk grid is specified:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid

The current spec organization is a bit confusing, though, in that in many cases there is 1 section that directly describes a given portion of the array metadata, and another separate section that provides more detailed information but does not specifically discuss the metadata representation. For example, we have the following two sections in the current spec related to the chunk grid:

Array metadata representation:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grid

More general information:
https://zarr-specs.readthedocs.io/en/latest/core/v3.0.html#chunk-grids

I think it would make sense to consolidate those sections together.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly worth thinking about. Sharding, for instance, would be thought of as hierarchical chunking. I don't know how we can fill it into this proposal, though. Ideas very welcome!

I think it could certainly make sense to use sharding in conjunction with a rectilinear grid --- in fact we could allow both the chunk grid and the shard grid to be rectilinear rather than regular.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jstriebel Any thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have referenced that link and stated that that's where the change will happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for answering so late. I think there's nothing that blocks using variable chunk sizes together with sharding in general, just the sharding extension would need to have support for this or a generalized abstract indexing schema. We also discussed briefly if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises, and it seems unnecessary complexity for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sharding should allow to combinae a flexible number of chunks, but that might be added later if the need arises

This need already arose and has been implemented in the special handling in preffs, an alternate implementation of referenceFS for kerchunked data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what is the use-case for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is representing data that has this internal structure. Being kerchunk, you are stuck with whatever the original has, rather than being able to rechunk.

cc @d70-t

@jbms
Copy link

jbms commented Oct 18, 2022

Regarding the TODO item in the hackmd document:

What is stored for chunk offsets? cumulative?
   Upside of cumulative, random access

In practice an implementation will likely need the cumulative values (i.e. partition points) to do a binary search to map array coordinates to chunks. However, the JSON metadata will anyway need to be fully decoded; it isn't practical to operate directly on the raw JSON representation anyway. Therefore I don't think it matters for implementation efficiency whether the stored representation is cumulative or not, and instead we should just decide based on other considerations.

Points in favor of cumulative representation:

  • Better supports non-zero origin (which could be added as a zarr extension)
  • Makes partition points more obvious from looking at JSON representation directly

Points in favor of non-cumulative representation:

  • Avoids confusion for users of zarr implementations such as in Julia which use an origin of 1

@martindurant
Copy link
Member Author

Another point in favour of non-cumulative:
the storage size for the list of chunks will usually be much smaller by storing the deltas. This might be important in some cases.

Note that for dask, the chunks are usually specified by size (step/delta) but stored as cumulative internally.

@martindurant
Copy link
Member Author

Community decision please: shall we merge this to make it an official "draft" and start the formal ZEP discussions? I can click the ready-for-review button.

@martindurant martindurant marked this pull request as ready for review October 20, 2022 14:55
@rabernat
Copy link
Contributor

Sorry I have not had time to comment yet.

I am happy to see this move to draft status, where we can discuss it further.

@MSanKeys963
Copy link
Member

Hi @martindurant.

I've added a small change, and the rest of the ZEP looks good. Is there anything you'd like to add/change before I merge this?
Also, I was wondering if you'd like to add more info in the Usage and Impact section.
Thanks!

@martindurant
Copy link
Member Author

No, let's leave those, since in the discussion phase we may well decide on different ways of going about this anyway.

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
@martindurant
Copy link
Member Author

@ivirshup welcome to the zep, and thank you for writing your version! I simply didn't get around to updating admin fields.

@MSanKeys963 MSanKeys963 merged commit 634ab24 into zarr-developers:main Oct 21, 2022
@MSanKeys963
Copy link
Member

ZEP0003 published here: https://zarr.dev/zeps/draft/ZEP0003.html 🚀

@ivirshup
Copy link
Contributor

Discussion on this ZEP currently going on at: https://github.com/orgs/zarr-developers/discussions/52

@joshmoore joshmoore changed the title Variable chunks ZEP0003 Variable chunks May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants