Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Store #373

Closed
wants to merge 1 commit into from
Closed

HTTP Store #373

wants to merge 1 commit into from

Conversation

rabernat
Copy link
Contributor

In keeping with the holiday theme of new stores (e.g. #299), I have created a bare bones implementation of an HTTP store. This is something I have been thinking about for a long time: the simplest possible way to access zarr data over the internet. Basically, the idea is to just issue http GET requests for all the desired data. This store only makes sense with consolidated metadata, since http does not support directory listing, and read-only stores. However, for public data, this drastically simplifies the process of accessing remote data, bypassing the need for external libraries such as s3fs, gcsfs, etc. It also opens the door to decentralized peer-to-peer sharing of zarr data: just fire up a web server in front of your consolidated DirectoryStore.

I feel like this is a promising path forward towards incorporating some sort of built-in remote dataset access within zarr. We have two long-pending PRs (#293 and #252) which implement custom classes for Azure Blob Storage and GCS. Given the overlap with @martindurant's s3fs, gcsfs, etc., it's not obvious that it is worth the effort of maintaining these sorts of stores within zarr. HTTP store is a middle path: if you just want read-only access to public data, zarr can provide that. Otherwise, you need the third-party libraries.

I'm not sure the best way to test this, since it is fundamentally a read-only store and there are no existing examples of that to copy. Suggestions welcome. However, the following code works:

import zarr
# an existing consolidated public dataset
url = 'https://storage.googleapis.com/pangeo-data/ecco/eccov4r3/'
store = zarr.storage.HTTPStore(url)
# a fully functional read-only Group full of arrays
group = zarr.open_consolidated(store)

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@martindurant
Copy link
Member

Please also consider the HTTP implementation in fsspec, which comes with a get_mapper() method for getting the mutable mapping object that zarr needs. https://github.com/martindurant/filesystem_spec/blob/master/fsspec/implementations/http.py

@rabernat
Copy link
Contributor Author

@martindurant - last I checked (a few months ago), fsspec had no tests and was not considered ready for production. I see that things have moved along quite a bit since. Perhaps my effort here, like with #252, is obsolete.

@martindurant
Copy link
Member

As with google or azure, there may well be a useful benefit to having a simpler storage class which only does the minimum required for zarr, rather than trying to be a whole file-system interface. However, it is also nice to have things in one place and with a consistent design/API.

I have been working on documentation and preparing fsspec for (alpha) release, but it has not been my main focus. The tests are only fairly rudimentary, and work is certainly needed. If I successfully make compatibility code for arrow-hdfs, s3fs and gcsfs ( fsspec/gcsfs#116 ) then it can probably already be considered as the backend for dask, swapping out code currently in dask.bytes.

@jakirkham
Copy link
Member

Honestly I'd be happy to see this integrated into Zarr. The code is very simple and I also know of some use cases where cloud storage is not involved and this is a perfect fit.

@jhamman
Copy link
Member

jhamman commented Jan 3, 2019

Naive question, how hard would it be to setup a read-only web-server for testing? I suspect that server could fetch data from a DictStore or similar...

@alimanfoo
Copy link
Member

Naive question, how hard would it be to setup a read-only web-server for testing? I suspect that server could fetch data from a DictStore or similar...

python -m http.server --directory /path/to/zarr/files/

?

@martindurant
Copy link
Member

This is the fixture fsspec uses (same as the command above, but with some retry/shutdown stuff)

https://github.com/martindurant/filesystem_spec/blob/master/fsspec/implementations/tests/test_http.py#L12

Note that the command was different in py2, if you wanted to support that.

@ambrosejcarr
Copy link

This back-end is very cool, and is likely what my team would need for our imaging use case. However, there is one thing that I want to make sure I understand -- when I run the code in the PR and attempt to discover any nested groups, it notes that this store only works with "consolidated metadata".

The imaging use cases that I represent require nested groups, and I can't figure out if this feature is supported but in a different way, or not supported by this store type. If it's not supported, could you think of ways to add it?

Our existing solution (not using zarr) involves explicitly storing a JSON key: value map in each group that specifies the location of any sub-groups.

@martindurant
Copy link
Member

In limited cases of well-behaved servers giving directory-hierarchy links, the following may work for you: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.http.HTTPFileSystem
The file-system has a get_mapper() method that gives you the kind of object that zarr needs.

Zarr stores subgroups in exactly the way you suggest. "Consolidated metadata" means gather up all of that information throughout a dataset into a single file, so that the number of connections to remote is minimised during the parsing phase. It is a useful optimisation. My implementation, if it works for your system, does not need this, but it's still a goo idea.

@rabernat
Copy link
Contributor Author

I believe that if you run zarr.consolidate_metadata on a store with nested groups, everything will end up in the single consolidated file. You could give this a try and report back.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 11, 2019 via email

@martindurant
Copy link
Member

Right. My implementation of ls for HTTP is strictly a hack, looking at links, and cannot be relied on in the general case.

@jakirkham
Copy link
Member

Has anyone tried this with a NestedDirectoryStore?

@aparamon
Copy link

aparamon commented Feb 5, 2019

WebDAVStore is something very desirable! If this PR implements HTTPStore compatibly then 🎆

@alimanfoo
Copy link
Member

WebDAVStore is something very desirable! If this PR implements HTTPStore compatibly then

Thanks @aparamon for the comment. I believe webdav is an extension of plain HTTP, and this store only uses the GET method, so it should be able to read data from a webdav server or a plain HTTP server.

I think the path for getting this PR complete would be just to add some tests. It's a read-only store, and a local HTTP server would need to be run, so tests would need some special setup, but should be fairly straightforward.

Support for write operations via webdav is out of scope for this PR I think, but if anyone wanted to implement a full WebDAVStore then a PR would be welcome AFAIC.

@rabernat
Copy link
Contributor Author

I am closing this PR and offering a sketch for the way forward.

It is not feasible to implement every possible type of remote storage protocol within the zarr-python package. Indeed, zarr allows us to bring our own storage classes with mutable mappings. However, this has some limitations: mutable mappings don't necessarily have other methods that more full-features zarr storage classes have, such as file size, etc.

Since this discussion started, fsspec has matured a lot. I think we should consider making fsspec an optional zarr dependency. We should write a zarr storage class for a generic fsspec filesystem which can take advantage of more features of the fsspec API than just a mutable mapping. Then we should hook into fsspec's resolver mechanism. That would allow us to do things like zarr.open('http://foo/bar) or zarr.open('s3://foo/bar') and have it just work. This would also help with the testing issue. We can test a generic fsspec-based filesystem's integration with zarr, and then leave the testing of all the different implementations to fsspec.

I will try to work on implementing this in a separate PR.

@rabernat rabernat closed this Feb 28, 2020
@ghost
Copy link

ghost commented Feb 28, 2020

@rabernat Let me know if you need any help. I helped with some work over at Kedro (https://github.com/quantumblacklabs/kedro) to make our DataSets (read: storage classes) use fsspec. It made sense for us, because it cut down on needing to have every file format duplicated n times with minor changes for n storage mediums (e.g CSV would need CSVBlob, CSVS3, CSVGCP, etc...) whilst we not only have one CSV storage class which loads and saves data based on the filepath provided using fsspec.

@martindurant
Copy link
Member

Obviously, let me know when fsspec has holes in functionality or bugs that prevent its takeup by zarr or other readers.
@ZainPatelQB , your DataSets sound rather a lot like intake sources :)

@ghost
Copy link

ghost commented Feb 28, 2020

Indeed @martindurant, we actually have an open issue about this here: https://github.com/quantumblacklabs/kedro/issues/26

@rabernat
Copy link
Contributor Author

To be clear, if someone else (@ZainPatelQB or @martindurant) wants to take the lead this, I would be thrilled. I unfortunately exist in a state of extreme overcommitment at this time. I have no clear idea when I can actually find the time for this.

It is not a hard task. Probably 100 lines of code max. I'd be happy to review PRs.

@ZainPatelQB - thanks for sharing kedro. It looks amazing!

@martindurant
Copy link
Member

This is what Dask does, essentially just one line https://github.com/dask/dask/blob/master/dask/array/core.py#L2804
and get_mapper is actually this function in fsspec.

@martindurant martindurant mentioned this pull request Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants