Native metadata storage #90

alimanfoo · 2020-08-27T09:33:36Z

Currently the v3 core protocol spec assumes that array and group metadata documents will be encoded in some way (default JSON) prior to storage. This allows use of stores like file systems or cloud object stores, because they don't need to be able to do anything other than store and retrieve a sequence of bytes.

However, some stores might be able to store metadata natively, i.e., without encoding. For example, a MongoDB store would be able to store the metadata documents natively. Similarly, an SQLite store might choose to store metadata natively, making use of the JSON1 extension.

There are several potential advantages to using these native storage capabilities for metadata. For example, both MongoDB and SQLite could manage concurrent modification to the user attributes on a zarr array or group. These storage engines could also support efficient queries across all metadata in a zarr hierarchy.

How should we deal with this in the v3 core protocol spec? I.e., how do we accommodate stores with native support for metadata storage, as well as stores that require metadata to be encoded?

alimanfoo · 2020-08-27T09:34:40Z

Hi @joshmoore, just tagging you in here to say I thought I'd raise this issue as a place to continue some of the conversations we had on the community call yesterday.

alimanfoo · 2020-08-27T10:07:35Z

Regarding a potential MongoDB store, here's a few interesting points.

There is a MongoDB store currently implemented in zarr-python, developed by @jhamman and @nbren12. It's nice and simple and works well AFAIK. However, it doesn't have any awareness of what it's storing. I.e., it treats chunks and metadata documents all the same. For example, here's the implementation of __setitem__:

def __setitem__(self, key, value):
    value = ensure_bytes(value)
    self.collection.replace_one({"key": key},
                                {"key": key, "value": value},
                                upsert=True)

In other words, this implementation stores everything in MongoDB as a single collection of documents, where each document is like {"key": "foo", "value": "some_bytes"}. The value could be an encoded metadata document, or an encoded chunk, the store doesn't care.

It would be possible to consider a slightly different implementation, where the store was aware of whether it was storing metadata or chunk data, and used different storage strategies for each.

For example, it could store metadata and chunk data in separate mongo collections. It could also store metadata natively. (And it could use gridfs for storing chunk data, but that's a tangential point.)

When storing metadata, the natural thing to do would be to have a collection of metadata documents, and use the storage key as the value of the "_id" field. E.g., if I had a zarr v3 hierarchy with the following metadata on the file system:

$ tree test.zr3/meta
test.zr3/meta
└── root
    ├── arthur
    │   └── dent.array
    └── tricia
        └── mcmillan.group

...then in mongo I could store two metadata documents in a metadata collection, like:

[
{
    "_id": "meta/root/arthur/dent.array",
    "shape": [
        5,
        10
    ],
    "data_type": "<i4",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [
            2,
            5
        ]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": null,
    "extensions": [],
    "attributes": {
        "question": "life",
        "answer": 42
    }
},
{
    "_id": "meta/root/tricia/mcmillan.group",
    "attributes": {
        "heart": "gold",
        "improbability": "infinite",
        "friends": ["Arthur Dent"]
    }
}
]

To retrieve a metadata document using pymongo, I could then do:

db.metadata_collection.find_one({"_id": "meta/root/tricia/mcmillan.group"})

To modify a field within a metadata document, I could do something like:

db.metadata_collection.update_one(
    {"_id": "meta/root/tricia/mcmillan.group"},
    {"$push": {"attributes.friends": "Zaphod Beeblebrox"}},
    upsert=True
)

To query across all metadata documents, I could do something like:

db.metadata_collection.find({"attributes.friends": "Arthur Dent"})

These advanced metadata update and query operations are not something you would necessarily want to expose as part of the standard Store API, but these are functionalities you could expose on a MongoDB store, or by accessing pymongo directly.

This is just meant to illustrate some possibilities. It is also meant to illustrate why I don't think the name clash problem that @joshmoore raised would occur, because here you are storing a collection of metadata documents, one for each node in the array, indexed on the storage key.

Carreau · 2020-08-27T16:33:29Z

I don't think there is much in the protocol that need to change, though implementation might want to be extra careful about what API they expose to make sure concurrent mutation are possible nad non conflicting.

I think that like we discussed yesterday that is one of the difference between a Storage Backend and an Access Backend.

rabernat · 2022-11-16T19:11:17Z

I strongly support the idea that encoding a native dict into json (or other encoded format) should be optional for a store, if the store has a built-in understanding of key-value documents. DynamoDB is another example.

What is needed is a mechanism for a store to declare whether it wants metadata documents to be encoded as bytes or not.

jbms · 2022-11-16T20:47:54Z

I think this could be handled as an implementation detail without a need for anything in the spec, since currently the spec does not describe any specific stores. However, if there is a desire to describe in the spec certain store formats, such as SQLite, then the spec for that store could also state that metadata documents should be stored in a certain way.

jstriebel · 2022-11-16T21:10:22Z

The spec for that store could also state that metadata documents should be stored in a certain way.

+1 from my side, see the current store specs: https://zarr-specs.readthedocs.io/en/latest/stores.html

joshmoore · 2022-11-17T14:22:49Z

then the spec for that store could also state that metadata documents should be stored in a certain way.

I'd almost say SHOULD if not MUST. And that would be an extension, which implementations can choose to implement or not. However there's an interesting bootstrapping problem, since if the name of the extension is in the metadata, then the implementation still needs to know how to load that in order to know the specification of how the metadata will be laid out.

Likely related to #65

rabernat · 2022-11-17T14:55:41Z

The key issue here is the interface between the store layer and whatever sits above it in the stack. What sort of objects is a store allowed to send and receive. Up until now, it has generally been assumed (if not stated explicitly) that values of the store's key/value system are always raw bytes, and the encoding / decoding is handled by the higher layer.

If we are not sending raw bytes to the store, then what are we sending? Abandoning the idea that stores always just deal in raw bytes creates a potentially leaky abstraction. Python dicts are a language-specific thing (although most languages have an equivalent data structure). Can we frame this is a way that is not specific to Python?

There is also the edge case of zarr.json, which, according to the spec, must be JSON.

alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Aug 27, 2020

jstriebel mentioned this issue Nov 24, 2022

Issue overview: From URI to open array #178

Closed

jstriebel mentioned this issue Feb 9, 2023

Allow native storage formats for stores. #203

Merged

joshmoore closed this as completed in #203 Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native metadata storage #90

Native metadata storage #90

alimanfoo commented Aug 27, 2020

alimanfoo commented Aug 27, 2020

alimanfoo commented Aug 27, 2020 •

edited

Loading

Carreau commented Aug 27, 2020

rabernat commented Nov 16, 2022

jbms commented Nov 16, 2022

jstriebel commented Nov 16, 2022

joshmoore commented Nov 17, 2022 •

edited

Loading

rabernat commented Nov 17, 2022

Native metadata storage #90

Native metadata storage #90

Comments

alimanfoo commented Aug 27, 2020

alimanfoo commented Aug 27, 2020

alimanfoo commented Aug 27, 2020 • edited Loading

Carreau commented Aug 27, 2020

rabernat commented Nov 16, 2022

jbms commented Nov 16, 2022

jstriebel commented Nov 16, 2022

joshmoore commented Nov 17, 2022 • edited Loading

rabernat commented Nov 17, 2022

alimanfoo commented Aug 27, 2020 •

edited

Loading

joshmoore commented Nov 17, 2022 •

edited

Loading