Add MongoDB storage backend? #299

nbren12 · 2018-09-19T23:41:17Z

I have been using MongoDB lately for logging the results of my computational experiments, and I have found that it is a pretty fun piece of software to use. Since MongoDB is designed to store large amounts of data across a distributed network, It seems like it could be a good match for zarr. I went ahead and made a very simple MutableMapping wrapper to MongoDB: https://gist.github.com/nbren12/9842cf22f173a864a7c8377c01ad06c5, and I am wondering if this is something that might be worth cleaning up and adding to zarr.

It would be interesting to benchmark it's performance compared to S3 and GCS backends. FWIW, a package which seems to use mongoDB in a similar way to store large binary datasets is https://github.com/manahl/arctic.

The text was updated successfully, but these errors were encountered:

nbren12 · 2018-09-19T23:45:44Z

@jhamman and @rabernat might be interested in this

martindurant · 2018-09-20T01:27:07Z

There are actually a number of similar key/value stores that could be used in this context, such as Redis. Seems to me like a reasonable thing to try.

rabernat · 2018-09-20T01:40:44Z

Sounds very cool. I'm not surprised that it works well. The whole point of these services is to serve up arbitrary chunks of data quickly.

However, the key question for cloud-based data storage is whether these benefits outweigh the cons. I see two main cons:

expense: if you are storing directly in cloud storage, you only pay for the bytes you store. With a database, you have to store the bytes, and you also have to pay for the compute instances that run the service
scalability: how do these services perform when you hit them with hundreds or thousands of simultaneous requests? At some point, you need to add load balancing / auto-scaling, which will inflate the cost. With object storage, there is no "service" between the user and the data, other than the cloud provider's storage service itself, which is highly scalable.

Another example of a data storage service that zarr could plug into is HDFS.

martindurant · 2018-09-20T01:42:59Z

hdfs3 already has a mapping that can be used: https://github.com/dask/hdfs3/blob/master/hdfs3/mapping.py

I've not been able to iron out the various problems with hdfs3, and have been telling people to use pyarrow's instead, but it would be easy to make the same for that (especially if reusing the code in fsspec!).

jakirkham · 2018-09-20T01:48:58Z

That sounds like a reasonable add to me. Would be good to hear @alimanfoo's thoughts on this.

FWIW I would separate the value of MongoDB from whether or not it lives on the cloud. For instance this could be a nice option for an in-house cluster.

Certainly HDFS support is useful to have as well.

martindurant · 2018-09-20T01:52:03Z

By the way here is the generic code I have in fsspec, which can work with any file-system that meets the spec.

nbren12 · 2018-09-20T06:13:56Z

A Redis backend would be even simpler to write than the MongoDB backend, and would be super cool for in-memory analysis.

I agree with Ryan's and Martin's points about the usefulness of S3/GCS vs MongoDB. It seems like MongoDB hosting in the cloud (https://mlab.com/plans/pricing/) is about 10x as expensive as S3, but MongoDB can be run on an local cluster. I don't quite understand why S3 is so cheap, since ultimately, the data is still being served by an active computer somehow.

alimanfoo · 2018-09-20T22:14:54Z

This sounds interesting. I think in general I would be happy to discuss new storage backends contributed to Zarr. We're in a period of exploring possibilities and so if it helps to support and focus that by bringing backend implementations into Zarr (as we're doing for Azure blob shortage) or at least putting a proof of concept into a PR (as @rabernat did for GCS) then that seems like a good thing to me. If there is interest to explore a particular backend like mongo or redis, I'd suggest to code up a working proof of concept, put it in a PR for discussion, and ideally do some simple benchmarking to have something to compare against other backends and so we can get a sense of whether it's likely to be useful and in what context. On 20 Sep 2018 7:13 am, "Noah D Brenowitz" <notifications@github.com> wrote: A Redis backend would be even simpler to write than the MongoDB backend, and would be super cool for in-memory analysis. I agree with Ryan's and Martin's points about the usefulness of S3/GCS vs MongoDB. It seems like MongoDB hosting in the cloud ( https://mlab.com/plans/pricing/) is about 10x as expensive as S3, but MongoDB can be run on an local cluster. I don't quite get S3 is so cheap, since ultimately, the data is still being served by an active computer somehow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#299 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QiYP3vxRFqAkxhLSQM4A0IQFNHArks5uczIlgaJpZM4WxJS4> .

nbren12 · 2018-09-20T22:25:37Z

Ok. I would be happy to put this in a PR. Are there any automated tests for the backends?

I would probably add a Redis backend to the PR as well.

alimanfoo · 2018-09-20T23:03:16Z

For tests take a look at zarr/tests/test_storage.py, there is a StoreTests base class you can extend. For mongo or redis which usually run as an external service, if you want the tests to run under Travis CI I don't know what options are available but others may have suggestions. If initial goal is to share a proof of concept then I think it's ok if not all tests pass and/or tests only run locally. Some discussion over on the ABS store PR may be relevant #293 On 20 Sep 2018 11:25 pm, "Noah D Brenowitz" <notifications@github.com> wrote: Ok. I would be happy to put this in a PR. Are there any automated tests for the backends? I would probably add a Redis backend to the PR as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#299 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qv6cJ7iXMccrjKDM50IxaEODhJLYks5udBXigaJpZM4WxJS4> .

alimanfoo · 2018-09-20T23:05:11Z

Also discussion of GCS PR relevant #252

…

On Fri, 21 Sep 2018, 00:02 Alistair Miles, ***@***.***> wrote: For tests take a look at zarr/tests/test_storage.py, there is a StoreTests base class you can extend. For mongo or redis which usually run as an external service, if you want the tests to run under Travis CI I don't know what options are available but others may have suggestions. If initial goal is to share a proof of concept then I think it's ok if not all tests pass and/or tests only run locally. Some discussion over on the ABS store PR may be relevant #293 On 20 Sep 2018 11:25 pm, "Noah D Brenowitz" ***@***.***> wrote: Ok. I would be happy to put this in a PR. Are there any automated tests for the backends? I would probably add a Redis backend to the PR as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#299 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qv6cJ7iXMccrjKDM50IxaEODhJLYks5udBXigaJpZM4WxJS4> .

jakirkham · 2018-09-20T23:13:01Z

Travis CI has a way to setup services pretty easily. Would look into that and see what they have for MongoDB and Redis. Should be as simple as listing them in that section of the YAML file.

nbren12 · 2018-09-21T00:10:12Z

Yah. I remember setting up Redis with travis for another project of mine. Here is the yaml file I used: https://github.com/nbren12/geostreams/blob/master/.travis.yml. Then, I made a pytest fixture for opening the python client.

…

On Thu, Sep 20, 2018 at 4:13 PM jakirkham ***@***.***> wrote: Travis CI has a way to setup services pretty easily. Would look into that and see what they have for MongoDB and Redis. Should be as simple as listing them in that section of the YAML file. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#299 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUokjNaUIHF_tlbbDBwinYxz7EW-uiUks5udCD_gaJpZM4WxJS4> .

nbren12 · 2018-09-21T02:14:56Z

I added a redis backend to the gist I linked above. I am loving how easy it is to write mutable mapping interfaces.

My approach has been to create a redis/pymongo's client object, which I then pass to the mapping backend. The downside of this approach is that it requires some boilerplate code. On the other hand, it is nice to have access to the client object. For the PR, should I instead initialize the client object from the __init__ method of the mutable mapping interface?

alimanfoo · 2018-09-21T08:21:05Z

@nbren12 re the client object I'd say it's up to you.

nbren12 · 2018-09-21T15:48:33Z

fair enough.

jakirkham · 2018-12-28T06:03:32Z

@nbren12, am curious to hear how you are handling the document size limit with MongoDB when using Zarr.

nbren12 · 2018-12-28T06:13:56Z

I don't ;). I just assume that each chunk is smaller than the 16 MB limit. IRC, MongoDB does have its own chunking strategy called GridFS for storing larger data, but it is probably best to avoid "chunks of chunks".

…

On Thu, Dec 27, 2018 at 10:03 PM jakirkham ***@***.***> wrote: @nbren12 <https://github.com/nbren12>, am curious to hear how you are handling the document size limit with MongoDB when using Zarr. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#299 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUokpwInxUNHQt63wEr7lfsFYXTV5N_ks5u9bQ1gaJpZM4WxJS4> .

jakirkham · 2018-12-28T09:13:08Z

TBH just keeping things within the 16MB limit may make sense for some work loads. For instance an uncompressed single precision 2048x2048 array would fit exactly in that limit. A larger array should be possible with compression. Of course this is assuming we want an array of that size in one chunk.

Anyways was mainly curious how this was working out for you in typical use cases. We probably should document this constraint for users less aware of what MongoDB is doing behind the scenes.

nbren12 · 2019-01-03T01:16:16Z

Yes, I agree this should probably be documented as part of Joe's PR.

…

On Fri, Dec 28, 2018 at 1:13 AM jakirkham ***@***.***> wrote: TBH just keeping things within the 16MB limit may make sense for some work loads. For instance an uncompressed single precision 2048x2048 array would fit exactly in that limit. A larger array should be possible with compression. Of course this is assuming we want an array of that size in one chunk. Anyways was mainly curious how this was working out for you in typical use cases. We probably should document this constraint for users less aware of what MongoDB is doing behind the scenes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#299 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUokp7pOKw98uMCzkjNaaqnbG4dxP1Dks5u9eCmgaJpZM4WxJS4> .

rabernat mentioned this issue Sep 23, 2018

Refactor storage around abstract file system? #301

Open

jhamman mentioned this issue Dec 27, 2018

MongoDB and Redis stores #372

Merged

7 tasks

rabernat mentioned this issue Dec 28, 2018

HTTP Store #373

Closed

7 tasks

alimanfoo closed this as completed in #372 Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MongoDB storage backend? #299

Add MongoDB storage backend? #299

nbren12 commented Sep 19, 2018

nbren12 commented Sep 19, 2018

martindurant commented Sep 20, 2018

rabernat commented Sep 20, 2018

martindurant commented Sep 20, 2018

jakirkham commented Sep 20, 2018

martindurant commented Sep 20, 2018

nbren12 commented Sep 20, 2018 •

edited

Loading

alimanfoo commented Sep 20, 2018 via email

nbren12 commented Sep 20, 2018

alimanfoo commented Sep 20, 2018 via email

alimanfoo commented Sep 20, 2018 via email

jakirkham commented Sep 20, 2018

nbren12 commented Sep 21, 2018 via email

nbren12 commented Sep 21, 2018 •

edited

Loading

alimanfoo commented Sep 21, 2018

nbren12 commented Sep 21, 2018

jakirkham commented Dec 28, 2018

nbren12 commented Dec 28, 2018 via email

jakirkham commented Dec 28, 2018

nbren12 commented Jan 3, 2019 via email

Add MongoDB storage backend? #299

Add MongoDB storage backend? #299

Comments

nbren12 commented Sep 19, 2018

nbren12 commented Sep 19, 2018

martindurant commented Sep 20, 2018

rabernat commented Sep 20, 2018

martindurant commented Sep 20, 2018

jakirkham commented Sep 20, 2018

martindurant commented Sep 20, 2018

nbren12 commented Sep 20, 2018 • edited Loading

alimanfoo commented Sep 20, 2018 via email

nbren12 commented Sep 20, 2018

alimanfoo commented Sep 20, 2018 via email

alimanfoo commented Sep 20, 2018 via email

jakirkham commented Sep 20, 2018

nbren12 commented Sep 21, 2018 via email

nbren12 commented Sep 21, 2018 • edited Loading

alimanfoo commented Sep 21, 2018

nbren12 commented Sep 21, 2018

jakirkham commented Dec 28, 2018

nbren12 commented Dec 28, 2018 via email

jakirkham commented Dec 28, 2018

nbren12 commented Jan 3, 2019 via email

nbren12 commented Sep 20, 2018 •

edited

Loading

nbren12 commented Sep 21, 2018 •

edited

Loading