-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MongoDB storage backend? #299
Comments
There are actually a number of similar key/value stores that could be used in this context, such as Redis. Seems to me like a reasonable thing to try. |
Sounds very cool. I'm not surprised that it works well. The whole point of these services is to serve up arbitrary chunks of data quickly. However, the key question for cloud-based data storage is whether these benefits outweigh the cons. I see two main cons:
Another example of a data storage service that zarr could plug into is HDFS. |
hdfs3 already has a mapping that can be used: https://github.com/dask/hdfs3/blob/master/hdfs3/mapping.py I've not been able to iron out the various problems with hdfs3, and have been telling people to use pyarrow's instead, but it would be easy to make the same for that (especially if reusing the code in fsspec!). |
That sounds like a reasonable add to me. Would be good to hear @alimanfoo's thoughts on this. FWIW I would separate the value of MongoDB from whether or not it lives on the cloud. For instance this could be a nice option for an in-house cluster. Certainly HDFS support is useful to have as well. |
By the way here is the generic code I have in fsspec, which can work with any file-system that meets the spec. |
A Redis backend would be even simpler to write than the MongoDB backend, and would be super cool for in-memory analysis. I agree with Ryan's and Martin's points about the usefulness of S3/GCS vs MongoDB. It seems like MongoDB hosting in the cloud (https://mlab.com/plans/pricing/) is about 10x as expensive as S3, but MongoDB can be run on an local cluster. I don't quite understand why S3 is so cheap, since ultimately, the data is still being served by an active computer somehow. |
This sounds interesting. I think in general I would be happy to discuss new
storage backends contributed to Zarr. We're in a period of exploring
possibilities and so if it helps to support and focus that by bringing
backend implementations into Zarr (as we're doing for Azure blob shortage)
or at least putting a proof of concept into a PR (as @rabernat did for GCS)
then that seems like a good thing to me.
If there is interest to explore a particular backend like mongo or redis,
I'd suggest to code up a working proof of concept, put it in a PR for
discussion, and ideally do some simple benchmarking to have something to
compare against other backends and so we can get a sense of whether it's
likely to be useful and in what context.
On 20 Sep 2018 7:13 am, "Noah D Brenowitz" <notifications@github.com> wrote:
A Redis backend would be even simpler to write than the MongoDB backend,
and would be super cool for in-memory analysis.
I agree with Ryan's and Martin's points about the usefulness of S3/GCS vs
MongoDB. It seems like MongoDB hosting in the cloud (
https://mlab.com/plans/pricing/) is about 10x as expensive as S3, but
MongoDB can be run on an local cluster. I don't quite get S3 is so cheap,
since ultimately, the data is still being served by an active computer
somehow.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#299 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QiYP3vxRFqAkxhLSQM4A0IQFNHArks5uczIlgaJpZM4WxJS4>
.
|
Ok. I would be happy to put this in a PR. Are there any automated tests for the backends? I would probably add a Redis backend to the PR as well. |
For tests take a look at zarr/tests/test_storage.py, there is a StoreTests
base class you can extend. For mongo or redis which usually run as an
external service, if you want the tests to run under Travis CI I don't know
what options are available but others may have suggestions. If initial goal
is to share a proof of concept then I think it's ok if not all tests pass
and/or tests only run locally.
Some discussion over on the ABS store PR may be relevant
#293
On 20 Sep 2018 11:25 pm, "Noah D Brenowitz" <notifications@github.com> wrote:
Ok. I would be happy to put this in a PR. Are there any automated tests for
the backends?
I would probably add a Redis backend to the PR as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#299 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qv6cJ7iXMccrjKDM50IxaEODhJLYks5udBXigaJpZM4WxJS4>
.
|
Also discussion of GCS PR relevant
#252
…On Fri, 21 Sep 2018, 00:02 Alistair Miles, ***@***.***> wrote:
For tests take a look at zarr/tests/test_storage.py, there is a StoreTests
base class you can extend. For mongo or redis which usually run as an
external service, if you want the tests to run under Travis CI I don't know
what options are available but others may have suggestions. If initial goal
is to share a proof of concept then I think it's ok if not all tests pass
and/or tests only run locally.
Some discussion over on the ABS store PR may be relevant
#293
On 20 Sep 2018 11:25 pm, "Noah D Brenowitz" ***@***.***>
wrote:
Ok. I would be happy to put this in a PR. Are there any automated tests
for the backends?
I would probably add a Redis backend to the PR as well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#299 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qv6cJ7iXMccrjKDM50IxaEODhJLYks5udBXigaJpZM4WxJS4>
.
|
Travis CI has a way to setup services pretty easily. Would look into that and see what they have for MongoDB and Redis. Should be as simple as listing them in that section of the YAML file. |
Yah. I remember setting up Redis with travis for another project of mine.
Here is the yaml file I used:
https://github.com/nbren12/geostreams/blob/master/.travis.yml. Then, I made
a pytest fixture for opening the python client.
…On Thu, Sep 20, 2018 at 4:13 PM jakirkham ***@***.***> wrote:
Travis CI has a way to setup services pretty easily. Would look into that
and see what they have for MongoDB and Redis. Should be as simple as
listing them in that section of the YAML file.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#299 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUokjNaUIHF_tlbbDBwinYxz7EW-uiUks5udCD_gaJpZM4WxJS4>
.
|
I added a redis backend to the gist I linked above. I am loving how easy it is to write mutable mapping interfaces. My approach has been to create a redis/pymongo's client object, which I then pass to the mapping backend. The downside of this approach is that it requires some boilerplate code. On the other hand, it is nice to have access to the client object. For the PR, should I instead initialize the client object from the |
@nbren12 re the client object I'd say it's up to you. |
fair enough. |
@nbren12, am curious to hear how you are handling the document size limit with MongoDB when using Zarr. |
I don't ;). I just assume that each chunk is smaller than the 16 MB limit.
IRC, MongoDB does have its own chunking strategy called GridFS for storing
larger data, but it is probably best to avoid "chunks of chunks".
…On Thu, Dec 27, 2018 at 10:03 PM jakirkham ***@***.***> wrote:
@nbren12 <https://github.com/nbren12>, am curious to hear how you are
handling the document size limit with MongoDB when using Zarr.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#299 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUokpwInxUNHQt63wEr7lfsFYXTV5N_ks5u9bQ1gaJpZM4WxJS4>
.
|
TBH just keeping things within the 16MB limit may make sense for some work loads. For instance an uncompressed single precision 2048x2048 array would fit exactly in that limit. A larger array should be possible with compression. Of course this is assuming we want an array of that size in one chunk. Anyways was mainly curious how this was working out for you in typical use cases. We probably should document this constraint for users less aware of what MongoDB is doing behind the scenes. |
Yes, I agree this should probably be documented as part of Joe's PR.
…On Fri, Dec 28, 2018 at 1:13 AM jakirkham ***@***.***> wrote:
TBH just keeping things within the 16MB limit may make sense for some work
loads. For instance an uncompressed single precision 2048x2048 array would
fit exactly in that limit. A larger array should be possible with
compression. Of course this is assuming we want an array of that size in
one chunk.
Anyways was mainly curious how this was working out for you in typical use
cases. We probably should document this constraint for users less aware of
what MongoDB is doing behind the scenes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#299 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABUokp7pOKw98uMCzkjNaaqnbG4dxP1Dks5u9eCmgaJpZM4WxJS4>
.
|
I have been using MongoDB lately for logging the results of my computational experiments, and I have found that it is a pretty fun piece of software to use. Since MongoDB is designed to store large amounts of data across a distributed network, It seems like it could be a good match for zarr. I went ahead and made a very simple MutableMapping wrapper to MongoDB: https://gist.github.com/nbren12/9842cf22f173a864a7c8377c01ad06c5, and I am wondering if this is something that might be worth cleaning up and adding to
zarr
.It would be interesting to benchmark it's performance compared to S3 and GCS backends. FWIW, a package which seems to use mongoDB in a similar way to store large binary datasets is https://github.com/manahl/arctic.
The text was updated successfully, but these errors were encountered: