Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MongoDB storage backend? #299

Closed
nbren12 opened this issue Sep 19, 2018 · 20 comments · Fixed by #372
Closed

Add MongoDB storage backend? #299

nbren12 opened this issue Sep 19, 2018 · 20 comments · Fixed by #372

Comments

@nbren12
Copy link

nbren12 commented Sep 19, 2018

I have been using MongoDB lately for logging the results of my computational experiments, and I have found that it is a pretty fun piece of software to use. Since MongoDB is designed to store large amounts of data across a distributed network, It seems like it could be a good match for zarr. I went ahead and made a very simple MutableMapping wrapper to MongoDB: https://gist.github.com/nbren12/9842cf22f173a864a7c8377c01ad06c5, and I am wondering if this is something that might be worth cleaning up and adding to zarr.

It would be interesting to benchmark it's performance compared to S3 and GCS backends. FWIW, a package which seems to use mongoDB in a similar way to store large binary datasets is https://github.com/manahl/arctic.

@nbren12
Copy link
Author

nbren12 commented Sep 19, 2018

@jhamman and @rabernat might be interested in this

@martindurant
Copy link
Member

There are actually a number of similar key/value stores that could be used in this context, such as Redis. Seems to me like a reasonable thing to try.

@rabernat
Copy link
Contributor

Sounds very cool. I'm not surprised that it works well. The whole point of these services is to serve up arbitrary chunks of data quickly.

However, the key question for cloud-based data storage is whether these benefits outweigh the cons. I see two main cons:

  • expense: if you are storing directly in cloud storage, you only pay for the bytes you store. With a database, you have to store the bytes, and you also have to pay for the compute instances that run the service
  • scalability: how do these services perform when you hit them with hundreds or thousands of simultaneous requests? At some point, you need to add load balancing / auto-scaling, which will inflate the cost. With object storage, there is no "service" between the user and the data, other than the cloud provider's storage service itself, which is highly scalable.

Another example of a data storage service that zarr could plug into is HDFS.

@martindurant
Copy link
Member

hdfs3 already has a mapping that can be used: https://github.com/dask/hdfs3/blob/master/hdfs3/mapping.py

I've not been able to iron out the various problems with hdfs3, and have been telling people to use pyarrow's instead, but it would be easy to make the same for that (especially if reusing the code in fsspec!).

@jakirkham
Copy link
Member

That sounds like a reasonable add to me. Would be good to hear @alimanfoo's thoughts on this.

FWIW I would separate the value of MongoDB from whether or not it lives on the cloud. For instance this could be a nice option for an in-house cluster.

Certainly HDFS support is useful to have as well.

@martindurant
Copy link
Member

By the way here is the generic code I have in fsspec, which can work with any file-system that meets the spec.

@nbren12
Copy link
Author

nbren12 commented Sep 20, 2018

A Redis backend would be even simpler to write than the MongoDB backend, and would be super cool for in-memory analysis.

I agree with Ryan's and Martin's points about the usefulness of S3/GCS vs MongoDB. It seems like MongoDB hosting in the cloud (https://mlab.com/plans/pricing/) is about 10x as expensive as S3, but MongoDB can be run on an local cluster. I don't quite understand why S3 is so cheap, since ultimately, the data is still being served by an active computer somehow.

@alimanfoo
Copy link
Member

alimanfoo commented Sep 20, 2018 via email

@nbren12
Copy link
Author

nbren12 commented Sep 20, 2018

Ok. I would be happy to put this in a PR. Are there any automated tests for the backends?

I would probably add a Redis backend to the PR as well.

@alimanfoo
Copy link
Member

alimanfoo commented Sep 20, 2018 via email

@alimanfoo
Copy link
Member

alimanfoo commented Sep 20, 2018 via email

@jakirkham
Copy link
Member

Travis CI has a way to setup services pretty easily. Would look into that and see what they have for MongoDB and Redis. Should be as simple as listing them in that section of the YAML file.

@nbren12
Copy link
Author

nbren12 commented Sep 21, 2018 via email

@nbren12
Copy link
Author

nbren12 commented Sep 21, 2018

I added a redis backend to the gist I linked above. I am loving how easy it is to write mutable mapping interfaces.

My approach has been to create a redis/pymongo's client object, which I then pass to the mapping backend. The downside of this approach is that it requires some boilerplate code. On the other hand, it is nice to have access to the client object. For the PR, should I instead initialize the client object from the __init__ method of the mutable mapping interface?

@alimanfoo
Copy link
Member

@nbren12 re the client object I'd say it's up to you.

@nbren12
Copy link
Author

nbren12 commented Sep 21, 2018

fair enough.

@jakirkham
Copy link
Member

@nbren12, am curious to hear how you are handling the document size limit with MongoDB when using Zarr.

@nbren12
Copy link
Author

nbren12 commented Dec 28, 2018 via email

@jakirkham
Copy link
Member

TBH just keeping things within the 16MB limit may make sense for some work loads. For instance an uncompressed single precision 2048x2048 array would fit exactly in that limit. A larger array should be possible with compression. Of course this is assuming we want an array of that size in one chunk.

Anyways was mainly curious how this was working out for you in typical use cases. We probably should document this constraint for users less aware of what MongoDB is doing behind the scenes.

@rabernat rabernat mentioned this issue Dec 28, 2018
7 tasks
@nbren12
Copy link
Author

nbren12 commented Jan 3, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants