Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native metadata storage #90

Closed
alimanfoo opened this issue Aug 27, 2020 · 8 comments · Fixed by #203
Closed

Native metadata storage #90

alimanfoo opened this issue Aug 27, 2020 · 8 comments · Fixed by #203
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@alimanfoo
Copy link
Member

Currently the v3 core protocol spec assumes that array and group metadata documents will be encoded in some way (default JSON) prior to storage. This allows use of stores like file systems or cloud object stores, because they don't need to be able to do anything other than store and retrieve a sequence of bytes.

However, some stores might be able to store metadata natively, i.e., without encoding. For example, a MongoDB store would be able to store the metadata documents natively. Similarly, an SQLite store might choose to store metadata natively, making use of the JSON1 extension.

There are several potential advantages to using these native storage capabilities for metadata. For example, both MongoDB and SQLite could manage concurrent modification to the user attributes on a zarr array or group. These storage engines could also support efficient queries across all metadata in a zarr hierarchy.

How should we deal with this in the v3 core protocol spec? I.e., how do we accommodate stores with native support for metadata storage, as well as stores that require metadata to be encoded?

@alimanfoo alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Aug 27, 2020
@alimanfoo
Copy link
Member Author

Hi @joshmoore, just tagging you in here to say I thought I'd raise this issue as a place to continue some of the conversations we had on the community call yesterday.

@alimanfoo
Copy link
Member Author

alimanfoo commented Aug 27, 2020

Regarding a potential MongoDB store, here's a few interesting points.

There is a MongoDB store currently implemented in zarr-python, developed by @jhamman and @nbren12. It's nice and simple and works well AFAIK. However, it doesn't have any awareness of what it's storing. I.e., it treats chunks and metadata documents all the same. For example, here's the implementation of __setitem__:

def __setitem__(self, key, value):
    value = ensure_bytes(value)
    self.collection.replace_one({"key": key},
                                {"key": key, "value": value},
                                upsert=True)

In other words, this implementation stores everything in MongoDB as a single collection of documents, where each document is like {"key": "foo", "value": "some_bytes"}. The value could be an encoded metadata document, or an encoded chunk, the store doesn't care.

It would be possible to consider a slightly different implementation, where the store was aware of whether it was storing metadata or chunk data, and used different storage strategies for each.

For example, it could store metadata and chunk data in separate mongo collections. It could also store metadata natively. (And it could use gridfs for storing chunk data, but that's a tangential point.)

When storing metadata, the natural thing to do would be to have a collection of metadata documents, and use the storage key as the value of the "_id" field. E.g., if I had a zarr v3 hierarchy with the following metadata on the file system:

$ tree test.zr3/meta
test.zr3/meta
└── root
    ├── arthur
    │   └── dent.array
    └── tricia
        └── mcmillan.group

...then in mongo I could store two metadata documents in a metadata collection, like:

[
{
    "_id": "meta/root/arthur/dent.array",
    "shape": [
        5,
        10
    ],
    "data_type": "<i4",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [
            2,
            5
        ]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": null,
    "extensions": [],
    "attributes": {
        "question": "life",
        "answer": 42
    }
},
{
    "_id": "meta/root/tricia/mcmillan.group",
    "attributes": {
        "heart": "gold",
        "improbability": "infinite",
        "friends": ["Arthur Dent"]
    }
}
]

To retrieve a metadata document using pymongo, I could then do:

db.metadata_collection.find_one({"_id": "meta/root/tricia/mcmillan.group"})

To modify a field within a metadata document, I could do something like:

db.metadata_collection.update_one(
    {"_id": "meta/root/tricia/mcmillan.group"},
    {"$push": {"attributes.friends": "Zaphod Beeblebrox"}},
    upsert=True
)

To query across all metadata documents, I could do something like:

db.metadata_collection.find({"attributes.friends": "Arthur Dent"})

These advanced metadata update and query operations are not something you would necessarily want to expose as part of the standard Store API, but these are functionalities you could expose on a MongoDB store, or by accessing pymongo directly.

This is just meant to illustrate some possibilities. It is also meant to illustrate why I don't think the name clash problem that @joshmoore raised would occur, because here you are storing a collection of metadata documents, one for each node in the array, indexed on the storage key.

@Carreau
Copy link
Contributor

Carreau commented Aug 27, 2020

I don't think there is much in the protocol that need to change, though implementation might want to be extra careful about what API they expose to make sure concurrent mutation are possible nad non conflicting.

I think that like we discussed yesterday that is one of the difference between a Storage Backend and an Access Backend.

@rabernat
Copy link
Contributor

I strongly support the idea that encoding a native dict into json (or other encoded format) should be optional for a store, if the store has a built-in understanding of key-value documents. DynamoDB is another example.

What is needed is a mechanism for a store to declare whether it wants metadata documents to be encoded as bytes or not.

@jbms
Copy link
Contributor

jbms commented Nov 16, 2022

I think this could be handled as an implementation detail without a need for anything in the spec, since currently the spec does not describe any specific stores. However, if there is a desire to describe in the spec certain store formats, such as SQLite, then the spec for that store could also state that metadata documents should be stored in a certain way.

@jstriebel
Copy link
Member

The spec for that store could also state that metadata documents should be stored in a certain way.

+1 from my side, see the current store specs: https://zarr-specs.readthedocs.io/en/latest/stores.html

@joshmoore
Copy link
Member

joshmoore commented Nov 17, 2022

then the spec for that store could also state that metadata documents should be stored in a certain way.

I'd almost say SHOULD if not MUST. And that would be an extension, which implementations can choose to implement or not. However there's an interesting bootstrapping problem, since if the name of the extension is in the metadata, then the implementation still needs to know how to load that in order to know the specification of how the metadata will be laid out.

Likely related to #65

@rabernat
Copy link
Contributor

The key issue here is the interface between the store layer and whatever sits above it in the stack. What sort of objects is a store allowed to send and receive. Up until now, it has generally been assumed (if not stated explicitly) that values of the store's key/value system are always raw bytes, and the encoding / decoding is handled by the higher layer.

If we are not sending raw bytes to the store, then what are we sending? Abandoning the idea that stores always just deal in raw bytes creates a potentially leaky abstraction. Python dicts are a language-specific thing (although most languages have an equivalent data structure). Can we frame this is a way that is not specific to Python?

There is also the edge case of zarr.json, which, according to the spec, must be JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants