Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: Standardizing .zmetadata #113

Open
DennisHeimbigner opened this issue May 8, 2021 · 11 comments
Open

v2: Standardizing .zmetadata #113

DennisHeimbigner opened this issue May 8, 2021 · 11 comments
Labels

Comments

@DennisHeimbigner
Copy link

I want to begin a discussion about standardizing
the .zmetadata format for consolidated metadata.

Suppose we have this Zarr container.

.zgroup -- of the root group
var1
    .zarray -- for var1
subgroup1
    .zgroup
    var2
        .zarray -- for var2
        .zattrs  -- for var2    

This structure needs to be encoded as JSON in the .zmetadata object.
I can see two obvious encodings:

  1. nested encoding
{
".zgroup": {<contents of the .zgroup>},
"var1": {
    ".zarray": {<contents of .zarray>},
    }
"subgroup1": {
    ".zgroup": {<contents of the .zgroup>},
    "var2": {
        ".zarray": {<contents of .zarray>},
        ".zattrs": {<contents of .zattrs>},
        }
    }
}
  1. flat-key encoding
{
"/.zgroup": {<contents of the .zgroup>},
"/var1/.zarray": {<contents of .zarray>},
"/subgroup1/.zgroup": {<contents of the .zgroup>},
"/subgroup1/var2/.zarray": {<contents of .zarray>},
"/subgroup1/var2/.zattrs": {<contents of .zattr>},
}

My observations:

  • The flat-key encoding should, as a rule, be slightly smaller than the
    nested encode
  • The nested encoding would easier to process into internal data structures,
    but that would depend on the implementation. It would be faster for netcdf-c,
    but might not be for zarr-python.
  • Note that I have prefixed each key with "/", but that is just my choice; a decision is need about that.
  • The one example I have seen in the wild uses flat-key encoding.
  • The flat-key encoding has no entries for non-content bearing objects. So, for example, there is no "/subgroup1" key nor a "/subgroup1/var2" key. This seems reasonable since it would not add any useful information.
@shoyer
Copy link

shoyer commented May 9, 2021

The current .zmetadata format written by Zarr-Python uses the "flat-key encoding" without a prefix, e.g., looking at one of my recent datasets:

{
    "metadata": {
        ".zattrs": ...
        ".zgroup": ...
        "lat/.zarray": ...
        "lat/.zattrs": ...
        "level/.zarray": ...
        "level/.zattrs": ...
        ...
        "z/.zarray": ...
        "z/.zattrs": ...
    },
    "zarr_consolidated_format": 1
}

I suspect this format was chosen in part because it's slightly more natural in Zarr-Python to look up flat-keys rather than nested keys. But I'm sure performance would be fine with nested keys, too. Inherently both seem fine to me.

I trust that performance for parsing either structure in netCDF-C would probably be OK, too? At least for "reasonable" size consolidated metadata? But even if performance would be similar, if it's significantly harder to work with non-nested metadata on some platforms that is definitely worth considering.

@DennisHeimbigner
Copy link
Author

The reason that a nested encoding is slightly preferable for netcdf is that we track groups as independent objects, so we have to parse flat-keys got get the group info out of them.
BTW, should there be a requirement that the order of flat keys be sorted?

@shoyer
Copy link

shoyer commented May 9, 2021

BTW, should there be a requirement that the order of flat keys be sorted?

JSON objects are unordered, so no, there should be no expectations about sorting.

@joshmoore
Copy link
Member

@DennisHeimbigner: are you thinking about the structure for V2, V3 or both?

@DennisHeimbigner
Copy link
Author

I was just looking at the existing V2 .zmetadata. Has the issue been raised for V3 yet?

@DennisHeimbigner
Copy link
Author

I see from this comment:
#41 (comment)
that there appears to be another .zmetadata encoding in use.

@shoyer
Copy link

shoyer commented May 13, 2021

I see from this comment:
#41 (comment)
that there appears to be another .zmetadata encoding in use.

Can you clarify what you mean here? I think this is the same .zmetadata encoding, just with keys printed in a different order (but JSON is not order sensitive).

@DennisHeimbigner
Copy link
Author

The header info


    'zarr_consolidated_format': 1,
    'metadata': {

is added. So it needs standarization also.

@shoyer
Copy link

shoyer commented May 13, 2021

The header info


    'zarr_consolidated_format': 1,
    'metadata': {

is added. So it needs standarization also.

Right, that was also in my example above: #113 (comment)

@rabernat
Copy link
Contributor

Just pinging this discussion based on today's zarr call: we need to sort out how to handle consolidated metadata for V3. Presumably it will be cheaper to list metadata in V3 because of the separation of metadata and data in the tree. But we still need an answer for "unlistable stores". Perhaps this is covered in the spec, but I did not see it.

@jstriebel
Copy link
Member

Consolidated metadata for v3 is being discussed in #136. Marking this issue for the v2 discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants