v2: Standardizing .zmetadata #113

DennisHeimbigner · 2021-05-08T19:45:32Z

I want to begin a discussion about standardizing
the .zmetadata format for consolidated metadata.

Suppose we have this Zarr container.

.zgroup -- of the root group
var1
    .zarray -- for var1
subgroup1
    .zgroup
    var2
        .zarray -- for var2
        .zattrs  -- for var2

This structure needs to be encoded as JSON in the .zmetadata object.
I can see two obvious encodings:

nested encoding

{
".zgroup": {<contents of the .zgroup>},
"var1": {
    ".zarray": {<contents of .zarray>},
    }
"subgroup1": {
    ".zgroup": {<contents of the .zgroup>},
    "var2": {
        ".zarray": {<contents of .zarray>},
        ".zattrs": {<contents of .zattrs>},
        }
    }
}

flat-key encoding

{
"/.zgroup": {<contents of the .zgroup>},
"/var1/.zarray": {<contents of .zarray>},
"/subgroup1/.zgroup": {<contents of the .zgroup>},
"/subgroup1/var2/.zarray": {<contents of .zarray>},
"/subgroup1/var2/.zattrs": {<contents of .zattr>},
}

My observations:

The flat-key encoding should, as a rule, be slightly smaller than the
nested encode
The nested encoding would easier to process into internal data structures,
but that would depend on the implementation. It would be faster for netcdf-c,
but might not be for zarr-python.
Note that I have prefixed each key with "/", but that is just my choice; a decision is need about that.
The one example I have seen in the wild uses flat-key encoding.
The flat-key encoding has no entries for non-content bearing objects. So, for example, there is no "/subgroup1" key nor a "/subgroup1/var2" key. This seems reasonable since it would not add any useful information.

The text was updated successfully, but these errors were encountered:

shoyer · 2021-05-09T20:21:42Z

The current .zmetadata format written by Zarr-Python uses the "flat-key encoding" without a prefix, e.g., looking at one of my recent datasets:

{
    "metadata": {
        ".zattrs": ...
        ".zgroup": ...
        "lat/.zarray": ...
        "lat/.zattrs": ...
        "level/.zarray": ...
        "level/.zattrs": ...
        ...
        "z/.zarray": ...
        "z/.zattrs": ...
    },
    "zarr_consolidated_format": 1
}

I suspect this format was chosen in part because it's slightly more natural in Zarr-Python to look up flat-keys rather than nested keys. But I'm sure performance would be fine with nested keys, too. Inherently both seem fine to me.

I trust that performance for parsing either structure in netCDF-C would probably be OK, too? At least for "reasonable" size consolidated metadata? But even if performance would be similar, if it's significantly harder to work with non-nested metadata on some platforms that is definitely worth considering.

DennisHeimbigner · 2021-05-09T22:42:19Z

The reason that a nested encoding is slightly preferable for netcdf is that we track groups as independent objects, so we have to parse flat-keys got get the group info out of them.
BTW, should there be a requirement that the order of flat keys be sorted?

shoyer · 2021-05-09T23:43:45Z

BTW, should there be a requirement that the order of flat keys be sorted?

JSON objects are unordered, so no, there should be no expectations about sorting.

joshmoore · 2021-05-13T08:58:57Z

@DennisHeimbigner: are you thinking about the structure for V2, V3 or both?

DennisHeimbigner · 2021-05-13T18:19:51Z

I was just looking at the existing V2 .zmetadata. Has the issue been raised for V3 yet?

DennisHeimbigner · 2021-05-13T18:24:56Z

I see from this comment:
#41 (comment)
that there appears to be another .zmetadata encoding in use.

shoyer · 2021-05-13T18:48:51Z

I see from this comment:
#41 (comment)
that there appears to be another .zmetadata encoding in use.

Can you clarify what you mean here? I think this is the same .zmetadata encoding, just with keys printed in a different order (but JSON is not order sensitive).

DennisHeimbigner · 2021-05-13T18:55:44Z

The header info


    'zarr_consolidated_format': 1,
    'metadata': {

is added. So it needs standarization also.

shoyer · 2021-05-13T19:04:40Z

The header info
    'zarr_consolidated_format': 1,
    'metadata': {
is added. So it needs standarization also.

Right, that was also in my example above: #113 (comment)

rabernat · 2021-08-25T19:04:41Z

Just pinging this discussion based on today's zarr call: we need to sort out how to handle consolidated metadata for V3. Presumably it will be cheaper to list metadata in V3 because of the separation of metadata and data in the tree. But we still need an answer for "unlistable stores". Perhaps this is covered in the spec, but I did not see it.

jstriebel · 2022-11-16T16:36:17Z

Consolidated metadata for v3 is being discussed in #136. Marking this issue for the v2 discussion.

joshmoore mentioned this issue Sep 23, 2021

Outreachy project proposals (Oct. 2021) zarr-developers/community#39

Closed

joshmoore mentioned this issue Jan 26, 2022

Community feedback process (e.g. ZEP) zarr-developers/governance#14

Closed

joshmoore mentioned this issue Apr 20, 2022

misc Zarr V3 bug fixes: open_group, open_consolidated and MemoryStoreV3 zarr-developers/zarr-python#1006

Merged

3 tasks

jstriebel added the v2 label Nov 16, 2022

joshmoore changed the title ~~Standardizing .zmetadata~~ v2: Standardizing .zmetadata Nov 17, 2022

joshmoore mentioned this issue Nov 17, 2022

consolidate_metadata for tables kevinyamauchi/ome-ngff-tables-prototype#12

Merged

d-v-b mentioned this issue Jul 11, 2023

don't use a field called items janelia-cellmap/pydantic-zarr#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2: Standardizing .zmetadata #113

v2: Standardizing .zmetadata #113

DennisHeimbigner commented May 8, 2021

shoyer commented May 9, 2021

DennisHeimbigner commented May 9, 2021

shoyer commented May 9, 2021

joshmoore commented May 13, 2021

DennisHeimbigner commented May 13, 2021

DennisHeimbigner commented May 13, 2021

shoyer commented May 13, 2021

DennisHeimbigner commented May 13, 2021

shoyer commented May 13, 2021

rabernat commented Aug 25, 2021

jstriebel commented Nov 16, 2022

v2: Standardizing .zmetadata #113

v2: Standardizing .zmetadata #113

Comments

DennisHeimbigner commented May 8, 2021

shoyer commented May 9, 2021

DennisHeimbigner commented May 9, 2021

shoyer commented May 9, 2021

joshmoore commented May 13, 2021

DennisHeimbigner commented May 13, 2021

DennisHeimbigner commented May 13, 2021

shoyer commented May 13, 2021

DennisHeimbigner commented May 13, 2021

shoyer commented May 13, 2021

rabernat commented Aug 25, 2021

jstriebel commented Nov 16, 2022