HTTP Store #373

rabernat · 2018-12-28T15:46:07Z

In keeping with the holiday theme of new stores (e.g. #299), I have created a bare bones implementation of an HTTP store. This is something I have been thinking about for a long time: the simplest possible way to access zarr data over the internet. Basically, the idea is to just issue http GET requests for all the desired data. This store only makes sense with consolidated metadata, since http does not support directory listing, and read-only stores. However, for public data, this drastically simplifies the process of accessing remote data, bypassing the need for external libraries such as s3fs, gcsfs, etc. It also opens the door to decentralized peer-to-peer sharing of zarr data: just fire up a web server in front of your consolidated DirectoryStore.

I feel like this is a promising path forward towards incorporating some sort of built-in remote dataset access within zarr. We have two long-pending PRs (#293 and #252) which implement custom classes for Azure Blob Storage and GCS. Given the overlap with @martindurant's s3fs, gcsfs, etc., it's not obvious that it is worth the effort of maintaining these sorts of stores within zarr. HTTP store is a middle path: if you just want read-only access to public data, zarr can provide that. Otherwise, you need the third-party libraries.

I'm not sure the best way to test this, since it is fundamentally a read-only store and there are no existing examples of that to copy. Suggestions welcome. However, the following code works:

import zarr
# an existing consolidated public dataset
url = 'https://storage.googleapis.com/pangeo-data/ecco/eccov4r3/'
store = zarr.storage.HTTPStore(url)
# a fully functional read-only Group full of arrays
group = zarr.open_consolidated(store)

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
Docs build locally (e.g., run tox -e docs)
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

martindurant · 2018-12-28T15:48:17Z

Please also consider the HTTP implementation in fsspec, which comes with a get_mapper() method for getting the mutable mapping object that zarr needs. https://github.com/martindurant/filesystem_spec/blob/master/fsspec/implementations/http.py

rabernat · 2018-12-28T15:54:27Z

@martindurant - last I checked (a few months ago), fsspec had no tests and was not considered ready for production. I see that things have moved along quite a bit since. Perhaps my effort here, like with #252, is obsolete.

martindurant · 2018-12-28T16:00:05Z

As with google or azure, there may well be a useful benefit to having a simpler storage class which only does the minimum required for zarr, rather than trying to be a whole file-system interface. However, it is also nice to have things in one place and with a consistent design/API.

I have been working on documentation and preparing fsspec for (alpha) release, but it has not been my main focus. The tests are only fairly rudimentary, and work is certainly needed. If I successfully make compatibility code for arrow-hdfs, s3fs and gcsfs ( fsspec/gcsfs#116 ) then it can probably already be considered as the backend for dask, swapping out code currently in dask.bytes.

jakirkham · 2018-12-28T18:17:44Z

Honestly I'd be happy to see this integrated into Zarr. The code is very simple and I also know of some use cases where cloud storage is not involved and this is a perfect fit.

jhamman · 2019-01-03T00:03:41Z

Naive question, how hard would it be to setup a read-only web-server for testing? I suspect that server could fetch data from a DictStore or similar...

alimanfoo · 2019-01-03T12:12:32Z

Naive question, how hard would it be to setup a read-only web-server for testing? I suspect that server could fetch data from a DictStore or similar...

python -m http.server --directory /path/to/zarr/files/

?

martindurant · 2019-01-03T14:15:47Z

This is the fixture fsspec uses (same as the command above, but with some retry/shutdown stuff)

https://github.com/martindurant/filesystem_spec/blob/master/fsspec/implementations/tests/test_http.py#L12

Note that the command was different in py2, if you wanted to support that.

ambrosejcarr · 2019-01-11T14:44:51Z

This back-end is very cool, and is likely what my team would need for our imaging use case. However, there is one thing that I want to make sure I understand -- when I run the code in the PR and attempt to discover any nested groups, it notes that this store only works with "consolidated metadata".

The imaging use cases that I represent require nested groups, and I can't figure out if this feature is supported but in a different way, or not supported by this store type. If it's not supported, could you think of ways to add it?

Our existing solution (not using zarr) involves explicitly storing a JSON key: value map in each group that specifies the location of any sub-groups.

martindurant · 2019-01-11T14:48:47Z

In limited cases of well-behaved servers giving directory-hierarchy links, the following may work for you: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.http.HTTPFileSystem
The file-system has a get_mapper() method that gives you the kind of object that zarr needs.

Zarr stores subgroups in exactly the way you suggest. "Consolidated metadata" means gather up all of that information throughout a dataset into a single file, so that the number of connections to remote is minimised during the parsing phase. It is a useful optimisation. My implementation, if it works for your system, does not need this, but it's still a goo idea.

rabernat · 2019-01-11T15:41:53Z

I believe that if you run zarr.consolidate_metadata on a store with nested groups, everything will end up in the single consolidated file. You could give this a try and report back.

alimanfoo · 2019-01-11T15:42:21Z

Yes nested groups would be supported. Just to elaborate a bit on what Martin said, basically the workflow for using the HTTPStore currently would be: Step 1. Create the full hierarchy of groups and arrays and store as files on a local file system (using a DirectoryStore). Step 2. Run zarr.consolidate_metadata(), which packs all the metadata into a single .zmetadata file in the root folder. Step 3. Serve those files via HTTP. Step 4. Access the data via HTTP using HTTPStore and zarr.open_consolidated(). Using consolidated metadata avoids the need for the HTTP server to support directory listing. There is no standard way to get a directory listing via HTTP, so this is probably the only option. If the HTTP server additionally supported the PROPFIND method from WebDAV then you could use that to get directory listings and wouldn't need to use consolidated metadata. But that's probably beyond the scope of the HTTPStore (maybe a WebDAVStore). Hth.

…

On Fri, 11 Jan 2019 at 14:48, Martin Durant ***@***.***> wrote: In limited cases of well-behaved servers giving directory-hierarchy links, the following may work for you: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.http.HTTPFileSystem The file-system has a get_mapper() method that gives you the kind of object that zarr needs. Zarr stores subgroups in exactly the way you suggest. "Consolidated metadata" means gather up all of that information throughout a dataset into a single file, so that the number of connections to remote is minimised during the parsing phase. It is a useful optimisation. My implementation, if it works for your system, does not need this, but it's still a goo idea. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#373 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QvJOmbSV7T4sO5ofyNbBHWFhVhXcks5vCKRPgaJpZM4Zj-bm> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo> Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

martindurant · 2019-01-11T15:47:27Z

Right. My implementation of ls for HTTP is strictly a hack, looking at links, and cannot be relied on in the general case.

jakirkham · 2019-01-11T22:13:14Z

Has anyone tried this with a NestedDirectoryStore?

aparamon · 2019-02-05T05:26:21Z

WebDAVStore is something very desirable! If this PR implements HTTPStore compatibly then 🎆

alimanfoo · 2019-02-05T08:53:28Z

WebDAVStore is something very desirable! If this PR implements HTTPStore compatibly then

Thanks @aparamon for the comment. I believe webdav is an extension of plain HTTP, and this store only uses the GET method, so it should be able to read data from a webdav server or a plain HTTP server.

I think the path for getting this PR complete would be just to add some tests. It's a read-only store, and a local HTTP server would need to be run, so tests would need some special setup, but should be fairly straightforward.

Support for write operations via webdav is out of scope for this PR I think, but if anyone wanted to implement a full WebDAVStore then a PR would be welcome AFAIC.

rabernat · 2020-02-28T20:51:35Z

I am closing this PR and offering a sketch for the way forward.

It is not feasible to implement every possible type of remote storage protocol within the zarr-python package. Indeed, zarr allows us to bring our own storage classes with mutable mappings. However, this has some limitations: mutable mappings don't necessarily have other methods that more full-features zarr storage classes have, such as file size, etc.

Since this discussion started, fsspec has matured a lot. I think we should consider making fsspec an optional zarr dependency. We should write a zarr storage class for a generic fsspec filesystem which can take advantage of more features of the fsspec API than just a mutable mapping. Then we should hook into fsspec's resolver mechanism. That would allow us to do things like zarr.open('http://foo/bar) or zarr.open('s3://foo/bar') and have it just work. This would also help with the testing issue. We can test a generic fsspec-based filesystem's integration with zarr, and then leave the testing of all the different implementations to fsspec.

I will try to work on implementing this in a separate PR.

ghost · 2020-02-28T21:18:57Z

@rabernat Let me know if you need any help. I helped with some work over at Kedro (https://github.com/quantumblacklabs/kedro) to make our DataSets (read: storage classes) use fsspec. It made sense for us, because it cut down on needing to have every file format duplicated n times with minor changes for n storage mediums (e.g CSV would need CSVBlob, CSVS3, CSVGCP, etc...) whilst we not only have one CSV storage class which loads and saves data based on the filepath provided using fsspec.

martindurant · 2020-02-28T21:27:02Z

Obviously, let me know when fsspec has holes in functionality or bugs that prevent its takeup by zarr or other readers.
@ZainPatelQB , your DataSets sound rather a lot like intake sources :)

ghost · 2020-02-28T21:29:38Z

Indeed @martindurant, we actually have an open issue about this here: https://github.com/quantumblacklabs/kedro/issues/26

rabernat · 2020-02-28T21:30:26Z

To be clear, if someone else (@ZainPatelQB or @martindurant) wants to take the lead this, I would be thrilled. I unfortunately exist in a state of extreme overcommitment at this time. I have no clear idea when I can actually find the time for this.

It is not a hard task. Probably 100 lines of code max. I'd be happy to review PRs.

@ZainPatelQB - thanks for sharing kedro. It looks amazing!

martindurant · 2020-03-02T14:45:33Z

This is what Dask does, essentially just one line https://github.com/dask/dask/blob/master/dask/array/core.py#L2804
and get_mapper is actually this function in fsspec.

basic HTTPStore class

9fc62ec

rabernat closed this Feb 28, 2020

mzjp2 mentioned this pull request Mar 2, 2020

N5Store support of cloud buckets #540

Closed

martindurant mentioned this pull request Mar 17, 2020

Add FSStore #546

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP Store #373

HTTP Store #373

rabernat commented Dec 28, 2018

martindurant commented Dec 28, 2018

rabernat commented Dec 28, 2018

martindurant commented Dec 28, 2018

jakirkham commented Dec 28, 2018

jhamman commented Jan 3, 2019

alimanfoo commented Jan 3, 2019

martindurant commented Jan 3, 2019

ambrosejcarr commented Jan 11, 2019

martindurant commented Jan 11, 2019

rabernat commented Jan 11, 2019

alimanfoo commented Jan 11, 2019 via email

martindurant commented Jan 11, 2019

jakirkham commented Jan 11, 2019

aparamon commented Feb 5, 2019

alimanfoo commented Feb 5, 2019

rabernat commented Feb 28, 2020

ghost commented Feb 28, 2020 •

edited by ghost

Loading

martindurant commented Feb 28, 2020

ghost commented Feb 28, 2020

rabernat commented Feb 28, 2020

martindurant commented Mar 2, 2020

HTTP Store #373

HTTP Store #373

Conversation

rabernat commented Dec 28, 2018

martindurant commented Dec 28, 2018

rabernat commented Dec 28, 2018

martindurant commented Dec 28, 2018

jakirkham commented Dec 28, 2018

jhamman commented Jan 3, 2019

alimanfoo commented Jan 3, 2019

martindurant commented Jan 3, 2019

ambrosejcarr commented Jan 11, 2019

martindurant commented Jan 11, 2019

rabernat commented Jan 11, 2019

alimanfoo commented Jan 11, 2019 via email

martindurant commented Jan 11, 2019

jakirkham commented Jan 11, 2019

aparamon commented Feb 5, 2019

alimanfoo commented Feb 5, 2019

rabernat commented Feb 28, 2020

ghost commented Feb 28, 2020 • edited by ghost Loading

martindurant commented Feb 28, 2020

ghost commented Feb 28, 2020

rabernat commented Feb 28, 2020

martindurant commented Mar 2, 2020

ghost commented Feb 28, 2020 •

edited by ghost

Loading