Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: google cloud storage class #252

Closed
wants to merge 3 commits into from

Conversation

rabernat
Copy link
Contributor

First, apologies for submitting an un-solicited pull request. I know that is against the contributor guidelines. I thought this idea would be easier to discuss with a concrete implementation to look at.

In my highly opinionated view, the killer feature of zarr is its ability to efficiently store array data in cloud storage. Currently, the recommended way to do this is via outside packages (e.g. s3fs, gcsfs), which provide a MutableMapping that zarr can store things in.

In this PR, I have implemented an experimental google cloud storage class directly within zarr.

Why did I do this? Because, in the pangeo project, we are now making heavy use of the xarray -> zarr -> gcsfs -> cloud storage stack. I have come to the conclusion that a tighter coupling between zarr and gcs, via the google.cloud.storage API, may prove advantageous.

In addition to performance benefits and easier debugging, I think there are social advantages to having cloud storage as a first-class part of zarr. Lots of people want to store arrays in the cloud, and if zarr can provide this capability more natively, it could increase adoption.

Thoughts?

These tests require GCP credentials and the google cloud storage package. It is possible add encrypted credentials to travis, but I haven't done that yet. Tests are (mostly) working locally for me.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or
    pytest -v --doctest-modules zarr)
  • Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or
    pytest -v zarr)
  • PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@jakirkham
Copy link
Member

Have you tried using this in your current setup already? Does it yield the speedup that you would expect?

@kalvdans
Copy link

What is the recommended way to test the code locally? Is there a dummy server one can use or do we have to mock the google.cloud.storage Python functions?

@alimanfoo
Copy link
Member

Thanks Ryan, personally I'm very happy to entertain this. We're actually giving some serious thought to setting up a pangeo-like environment in the cloud to support our genomics work, so having an efficient interface to cloud object storage will become important for us I expect. There is also precedent for moving storage classes into zarr to allow for optimisation and increased convenience. E.g., both the DirectoryStore and ZipStore classes replicate a lot of what was already provided in the zict package, but I ended up re-implementing because there were optimisations I wanted to add and it was easier to do that within zarr.

I just took a browse through the gcsfs code and what you have implemented is a lot leaner, which is very nice. Out of interest, do you see a performance gain over using gcsfs, and do you know what is giving the gain?

$64,000 question, if we did merge this, would you be willing to join as a core developer, with no commitment other than to maintain the gcs-related code?

@rabernat rabernat changed the title google cloud storage class WIP: google cloud storage class Mar 30, 2018
@rabernat
Copy link
Contributor Author

Have you tried using this in your current setup already? Does it yield the speedup that you would expect?

So far, in my simple benchmarks, the performance is exactly the same as gcsfs. However, my hope is that it will be easier to debug when problems occur (see e.g. pangeo-data/pangeo#166, fsspec/gcsfs#91, fsspec/gcsfs#90 fsspec/gcsfs#89, fsspec/gcsfs#61). Also, I expect that having tighter coupling between zarr and gcs will allow us to optimize performance more going forward.

What is the recommended way to test the code locally? Is there a dummy server one can use or do we have to mock the google.cloud.storage Python functions?

I don't know about this. Finding a way to mock the google.cloud.storage would be useful. However, running the test suite on the real google service also has major advantages.

$64,000 question, if we did merge this, would you be willing to join as a core developer, with no commitment other than to maintain the gcs-related code?

Sure.

I would like to do a lot more testing before moving forward. In particular, I want to understand how this behaves in the context of distributed. Serializing this store is a bit tricky. This is pretty mature in gcsfs.

@alimanfoo
Copy link
Member

alimanfoo commented Mar 30, 2018 via email

@jakirkham
Copy link
Member

...unless there is a free CI service that runs inside Google cloud

It seems like this would a boon for Google as they would be making it easier for developers to build up an ecosystem around their service. Thus bringing in more customers reliant on that ecosystem. Makes me wonder if it really does exist and it will just take a little searching to find it. Even if they don't advertise it, we can probably ask and they may give us some free instance for testing.

@alimanfoo
Copy link
Member

alimanfoo commented Apr 2, 2018 via email

@martindurant
Copy link
Member

I'll be happy to help here if there's anything I can do.

@alimanfoo
Copy link
Member

alimanfoo commented Apr 9, 2018 via email

@tjcrone
Copy link
Member

tjcrone commented Apr 13, 2018

I tried installing with this commit using:

pip install git+https://github.com/rabernat/zarr@17c9f11073004689b0a505b2df0fd0c0437cad41

which worked, and installed as:

zarr                      2.1.5.dev478+dirty           <pip>

However when I tried running ds.to_zarr(), I got the following error:

Zarr version 2.2 or greater is required by xarray. See zarr installation http://zarr.readthedocs.io/en/stable/#installation

Any ideas on how to fix this? Thanks!

cc @friedrichknuth

@jakirkham
Copy link
Member

Probably need to merge with latest master locally.

@tjcrone
Copy link
Member

tjcrone commented Apr 13, 2018

Okay thanks. I will go ahead and fork from @rabernat and merge master to test this.

@tjcrone
Copy link
Member

tjcrone commented Apr 13, 2018

Actually the right way to do this was to fork upstream and then add @rabernat's repo as a remote and merge his gcs_store branch. Git wranglin!

@tjcrone
Copy link
Member

tjcrone commented Apr 13, 2018

now when i try ds.to_zarr('pangeo-asdf2'), I get:

distributed.scheduler - ERROR - error from worker tcp://10.244.59.6:45973: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.67.6:36687: array not found at path 'video'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-823bd5f45788> in <module>()
----> 1 ds.to_zarr('pangeo-asdf2')

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding)
   1163         from ..backends.api import to_zarr
   1164         return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
-> 1165                        group=group, encoding=encoding)
   1166 
   1167     def __unicode__(self):

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding)
    777     # I think zarr stores should always be sync'd immediately
    778     # TODO: figure out how to properly handle unlimited_dims
--> 779     dataset.dump_to_store(store, sync=True, encoding=encoding)
    780     return store

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims)
   1068                     unlimited_dims=unlimited_dims)
   1069         if sync:
-> 1070             store.sync()
   1071 
   1072     def to_netcdf(self, path=None, mode='w', format=None, group=None,

/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in sync(self)
    365 
    366     def sync(self):
--> 367         self.writer.sync()
    368 
    369 

/opt/conda/lib/python3.6/site-packages/xarray/backends/common.py in sync(self)
    268         if self.sources:
    269             import dask.array as da
--> 270             da.store(self.sources, self.targets, lock=self.lock)
    271             self.sources = []
    272             self.targets = []

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in store(sources, targets, lock, regions, compute, return_stored, **kwargs)
    953 
    954         if compute:
--> 955             result.compute(**kwargs)
    956             return None
    957         else:

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    153         dask.base.compute
    154         """
--> 155         (result,) = compute(self, traverse=False, **kwargs)
    156         return result
    157 

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    402     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
    403                     else (None, a) for a in args]
--> 404     results = get(dsk, keys, **kwargs)
    405     results_iter = iter(results)
    406     return tuple(a if f is None else f(next(results_iter), *a)

/opt/conda/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs)
   2089             try:
   2090                 results = self.gather(packed, asynchronous=asynchronous,
-> 2091                                       direct=direct)
   2092             finally:
   2093                 for f in futures.values():

/opt/conda/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1501             return self.sync(self._gather, futures, errors=errors,
   1502                              direct=direct, local_worker=local_worker,
-> 1503                              asynchronous=asynchronous)
   1504 
   1505     @gen.coroutine

/opt/conda/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    613             return future
    614         else:
--> 615             return sync(self.loop, func, *args, **kwargs)
    616 
    617     def __repr__(self):

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    251             e.wait(10)
    252     if error[0]:
--> 253         six.reraise(*error[0])
    254     else:
    255         return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
    236             yield gen.moment
    237             thread_state.asynchronous = True
--> 238             result[0] = yield make_coro()
    239         except Exception as exc:
    240             error[0] = sys.exc_info()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105                     if exc_info is not None:
   1106                         try:
-> 1107                             yielded = self.gen.throw(*exc_info)
   1108                         finally:
   1109                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1379                             six.reraise(type(exception),
   1380                                         exception,
-> 1381                                         traceback)
   1382                     if errors == 'skip':
   1383                         bad_keys.add(key)

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

/opt/conda/lib/python3.6/site-packages/distributed/protocol/pickle.py in loads()
     57 def loads(x):
     58     try:
---> 59         return pickle.loads(x)
     60     except Exception:
     61         logger.info("Failed to deserialize %s", x[:10000], exc_info=True)

/opt/conda/lib/python3.6/site-packages/zarr/core.py in __setstate__()
   1928 
   1929     def __setstate__(self, state):
-> 1930         self.__init__(*state)
   1931 
   1932     def _synchronized_op(self, f, *args, **kwargs):

/opt/conda/lib/python3.6/site-packages/zarr/core.py in __init__()
    121 
    122         # initialize metadata
--> 123         self._load_metadata()
    124 
    125         # initialize attributes

/opt/conda/lib/python3.6/site-packages/zarr/core.py in _load_metadata()
    138         """(Re)load metadata from store."""
    139         if self._synchronizer is None:
--> 140             self._load_metadata_nosync()
    141         else:
    142             mkey = self._key_prefix + array_meta_key

/opt/conda/lib/python3.6/site-packages/zarr/core.py in _load_metadata_nosync()
    149             meta_bytes = self._store[mkey]
    150         except KeyError:
--> 151             err_array_not_found(self._path)
    152         else:
    153 

/opt/conda/lib/python3.6/site-packages/zarr/errors.py in err_array_not_found()
     23 
     24 def err_array_not_found(path):
---> 25     raise ValueError('array not found at path %r' % path)
     26 
     27 

ValueError: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.68.6:45095: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.64.6:36365: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.57.6:46707: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.69.6:43569: array not found at path 'video'

@rabernat
Copy link
Contributor Author

@tjcrone: your error is unrelated to this PR. When you call ds.to_zarr('string'), you write to a zarr.DirectoryStore by default. You are getting an error because the distributed workers can't see the directory.

In order to use this store, you need to create a GCSStore directly, e.g.

gcsstore = zarr.storage.GCSStore('zarr-test', 'zarr-gcs-store', client_kwargs={'project': 'pangeo-181919'})
ds.to_zarr(gcsstore)

Note that the branch is highly experimental, work in progress, and consequently highly error prone.

@tjcrone
Copy link
Member

tjcrone commented Apr 13, 2018

Thanks @rabernat. I'll make sure to test this all locally inside the notebook server for now and not involve workers. When trying your suggestion,

gcsstore = zarr.storage.GCSStore('zarr-test', 'zarr-gcs-store', client_kwargs={'project': 'pangeo-181919'})

(and changing out for my project id), I got the following error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-22-27e581132598> in <module>()
----> 1 gcsstore = zarr.storage.GCSStore('pangeo-asdf', client_kwargs={'project': 'pangeo-198314'})

/opt/conda/lib/python3.6/site-packages/zarr/storage.py in __init__(self, bucket_name, prefix, client_kwargs)
   1941         self.prefix = normalize_storage_path(prefix)
   1942         self.client_kwargs = {}
-> 1943         self.initialize_bucket()
   1944 
   1945     def initialize_bucket(self):

/opt/conda/lib/python3.6/site-packages/zarr/storage.py in initialize_bucket(self)
   1946         from google.cloud import storage
   1947         # run `gcloud auth application-default login` from shell
-> 1948         client = storage.Client(**self.client_kwargs)
   1949         self.bucket = client.get_bucket(self.bucket_name)
   1950         # need to properly handle excpetions

/opt/conda/lib/python3.6/site-packages/google/cloud/storage/client.py in __init__(self, project, credentials, _http)
     57         self._base_connection = None
     58         super(Client, self).__init__(project=project, credentials=credentials,
---> 59                                      _http=_http)
     60         self._connection = Connection(self)
     61         self._batch_stack = _LocalStack()

/opt/conda/lib/python3.6/site-packages/google/cloud/client.py in __init__(self, project, credentials, _http)
    213 
    214     def __init__(self, project=None, credentials=None, _http=None):
--> 215         _ClientProjectMixin.__init__(self, project=project)
    216         Client.__init__(self, credentials=credentials, _http=_http)

/opt/conda/lib/python3.6/site-packages/google/cloud/client.py in __init__(self, project)
    169         project = self._determine_default(project)
    170         if project is None:
--> 171             raise EnvironmentError('Project was not passed and could not be '
    172                                    'determined from the environment.')
    173         if isinstance(project, six.binary_type):

OSError: Project was not passed and could not be determined from the environment.

I feel like the project was passed.

zarr/storage.py Outdated

self.bucket_name = bucket_name
self.prefix = normalize_storage_path(prefix)
self.client_kwargs = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjcrone I guess because client_kwargs is not actually initialized properly!

@tjcrone
Copy link
Member

tjcrone commented Apr 13, 2018

Thanks @rabernat. This fix worked! Nice. The only comment I have at this stage is that when a group within the bucket already exists we get:

ValueError: path '' contains a group

which might be better if it was more explanatory and also better if it pulled the path into the error.

@alimanfoo
Copy link
Member

alimanfoo commented Apr 13, 2018 via email

@tjcrone
Copy link
Member

tjcrone commented Apr 14, 2018

I think it is worth noting that for this particular extension, when the path is not empty, it shows up as empty in the error:

ValueError: path '' contains a group

It is also worth noting that an indication that the path "already" contains a group would go a long way to making this error more helpful. Or saying that the group "already exists" also works.

@alimanfoo
Copy link
Member

alimanfoo commented Apr 14, 2018 via email

@tjcrone
Copy link
Member

tjcrone commented Apr 14, 2018

Sorry for not including an example.

This code works fine when it runs the first time, as long as the rte-pangeo-data bucket exists (it will not create a bucket):

import zarr
gcsstore = zarr.storage.GCSStore('rte-pangeo-data', 'test1', client_kwargs={'project': 'pangeo-198314'})
ds.to_zarr(gcsstore)

But when run a second time it gives:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-f5593b801545> in <module>()
----> 1 ds.to_zarr(gcsstore)

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding)
   1163         from ..backends.api import to_zarr
   1164         return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
-> 1165                        group=group, encoding=encoding)
   1166 
   1167     def __unicode__(self):

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding)
    773     store = backends.ZarrStore.open_group(store=store, mode=mode,
    774                                           synchronizer=synchronizer,
--> 775                                           group=group, writer=None)
    776 
    777     # I think zarr stores should always be sync'd immediately

/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, writer)
    258                                       "#installation" % min_zarr)
    259         zarr_group = zarr.open_group(store=store, mode=mode,
--> 260                                      synchronizer=synchronizer, path=group)
    261         return cls(zarr_group, writer=writer)
    262 

/opt/conda/lib/python3.6/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path)
   1126             err_contains_array(path)
   1127         elif contains_group(store, path=path):
-> 1128             err_contains_group(path)
   1129         else:
   1130             init_group(store, path=path)

/opt/conda/lib/python3.6/site-packages/zarr/errors.py in err_contains_group(path)
     15 
     16 def err_contains_group(path):
---> 17     raise ValueError('path %r contains a group' % path)
     18 
     19 

ValueError: path '' contains a group

@tjcrone
Copy link
Member

tjcrone commented Apr 14, 2018

In case anyone wants to test this with multiple Dask workers, it is possible to create an Oauth credentials object using the following:

import google.auth
credentials, project = google.auth.default()

and then passing this credentials object when creating the gcsstore:

gcsstore = zarr.storage.GCSStore('rte-pangeo-data', 'test1', client_kwargs={'project': 'pangeo-198314', 'credentials': credentials})

For this to work as written it would be necessary to authenticate using the Google Cloud SDK:

gcloud auth application-default login --no-launch-browser

So far in all of my testing this code is working great!

@rabernat
Copy link
Contributor Author

rabernat commented Apr 14, 2018 via email

@tjcrone
Copy link
Member

tjcrone commented Apr 14, 2018

Interesting question. How would I test this? Note that this credentials object is secret, and allows a lot of access to my own GCP resources. I'm not sure I would want to pickle and distribute. There are ways of creating credentials with reduced permissions: https://google-auth.readthedocs.io/en/latest/user-guide.html.

@jakirkham
Copy link
Member

FWIW opened issue ( https://github.com/zarr-developers/zarr/issues/290 ) to discuss cloud support generally. Please feel free to share anything relevant there.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments from experiences trying this out and looking at performance for retrieving small objects.

`default credentials <https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login>`_.
"""

def __init__(self, bucket_name, prefix=None, client_kwargs={}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding an option to use anonymous client. E.g., add an anonymous=False keyword argument, then make use of storage.Client.create_anonymous_client() when it comes to creating the client if user has provided anonymous=True.

from google.cloud import storage
# run `gcloud auth application-default login` from shell
client = storage.Client(**self.client_kwargs)
self.bucket = client.get_bucket(self.bucket_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it's also possible to do:

self.bucket = storage.Bucket(client, name=self.bucket_name)

...which involves no network communication. Not sure this is a good idea in general as may want to retrieve the bucket info, but just mentioning.


def __getitem__(self, key):
blob_name = self.full_path(key)
blob = self.bucket.get_blob(blob_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative here is to do:

        from google.cloud import storage
        blob = storage.Blob(blob_name, self.bucket)

...which involves less network communication (profiling shows number of calls to method 'read' of '_ssl._SSLSocket' objects goes from 3 down to 1) and reduces the time to retrieve small objects by around 50%.

If this change was made, some rethinking of error handling may be needed, as the point at which a non-existing blob was detected might change.

@alimanfoo
Copy link
Member

@rabernat just to say that, FWIW, I think this is worth pursuing. I know @martindurant has just added in some improvements to gcsfs to reduce latency and use checksums to verify content, which is great, and so both the performance and error reporting issues that have come up with gcsfs may be resolved. But while we are still gaining experience with GCS, also having a mapping implementation based on the google cloud client library I think would be valuable, so we can compare performance and see if issues replicate across both this and gcsfs or not. Obviously contingent on you (or someone else) having time and inclination, but I'm currently also starting to use google cloud storage and so would be happy to chip in.

@rabernat
Copy link
Contributor Author

rabernat commented Sep 7, 2018

Great! I agree it is a good way forward.

Moving forward, I think a good question (related to #290) is whether we want to have a generic base class for object stores and then extend that for GCS, S3, ABS, etc. Maybe this is overcomplicating things though...

The reality is that I am teaching two classes this semester and am unlikely to have the time to dig into this deeply any time soon. You, @martindurant, @tjcrone, and @jakirkham are all clearly qualified to pick up where I left off here.

@martindurant
Copy link
Member

Having a mapping class over generic file-system implementations was one of the points of fsspec, which will, of course, look rather familiar.

@alimanfoo
Copy link
Member

alimanfoo commented Sep 8, 2018 via email

@alimanfoo
Copy link
Member

Btw it looks like there is no local emulator for GCS, and this is an open issue for the google cloud Python client library: https://github.com/googleapis/google-cloud-python/issues/4840 (see also googleapis/google-cloud-python#4897).

@martindurant
Copy link
Member

@alimanfoo - well aware of this, and had to jump through a number of uncomfortable hoops to test gcsfs using vcrpy (which can record and mock any urllib calls, but not easily). moto3 and azurite really help for s3 and azure-datalake/blob in this respect.

@alimanfoo
Copy link
Member

@martindurant thanks, yes one of your comments was what prompted me to do a bit of digging. I guess it might be worth trying to put a bit of pressure on Google folks, it doesn't look like they've prioritised this highly.

@rabernat rabernat mentioned this pull request Dec 28, 2018
7 tasks
@jakirkham
Copy link
Member

Is this still interesting, @rabernat? FWIW I'd be +1 on getting this integrated. Probably still some things to address before merging though.

@alimanfoo
Copy link
Member

alimanfoo commented Mar 2, 2019 via email

@martindurant
Copy link
Member

there is support for local emulation

Where did you see that? It would make testing gcsfs much easier!

@alimanfoo
Copy link
Member

alimanfoo commented Mar 25, 2019 via email

@rabernat
Copy link
Contributor Author

rabernat commented Mar 25, 2019

Before adding any new storage classes, I think we should address #414 and #301. The storage module is already super long and, at this point, pretty random. We have direct support for Azure and Redis but not S3?

I basically stopped working on this because I wanted to see how filesystemspec evolved. Maybe now is the time to assess whether we want to continue implementing more storage classes or add filesystemspec as a dependency. Someone also recently posted a link to another key/value abstraction we could look at, but I can't find the link.

This is also tied to the spec discussion. As we start dumping zarr in to more and more stores, how do we ensure that these will be readable by all zarr implementations?

My view is that we need more explicit specs for file and cloud storage that will allow other libraries to implement them in a compatible way.

@martindurant
Copy link
Member

Should we discuss this explicitly at the next zarr meeting? I think a little effort should be spent integrating gcsfs, s3fs and adlfs (or blob) completely with fsspec to complete the picture.

@jakirkham
Copy link
Member

What follows are merely my opinions, please feel free to agree or disagree with them as you see fit.

Refactoring out the existing storage layers into an independent library is probably already a worthwhile endeavor. There seem to be other people out there looking for or trying similar things. So it would be useful to engage them at that level. It will probably also make it easier to handle optional dependencies as was needed recently for Azure.

So S3 already works without any effort on our end. Namely s3fs provides S3Map, which we are able to use as noted in the tutorial. Though am now noticing there is GCSMap, which is maybe already sufficient? Though maybe you have tried this already. If so, what issues did you encounter?

Personally I'm not convinced that Zarr needs to use filesystem_spec. That isn't to say we shouldn't allow people the option or make sure things work in case people would like to use it. Am just not thinking it needs to be a required dependency of Zarr.

Are you thinking of simplekv? Posted that in the refactoring issue ( #414 ). They definitely have some similarities to us. Though it also has a few extra features we probably don't need, but it is a closer match to what we have. Probably worthwhile to work with them on a solution if they're interested.

It's reasonable to be concerned about spec impact. Though I think the key-value stores are the least of our concerns as they are already well-articulated and unlikely to change (unless we start adding links and references). The concerning part from the spec point of view is how we handle translations of data to key-value pairs in these stores. For example object types are something that already is an obstacle to compatibility that we will need to figure out. There are a few other examples as well.

@martindurant
Copy link
Member

martindurant commented Mar 26, 2019

So S3 already works without any effort on our end. Namely s3fs provides S3Map, which we are able to use as noted in the tutorial. Though am now noticing there is GCSMap, which is maybe already sufficient? Though maybe you have tried this already. If so, what issues did you encounter?

S3Map and GCSMap are currently used for accessing zarr (which is why I was not convinced of the need for a google mapping implementation or in fact azure blob). Indeed, some access patterns were changed in gcsfs specifically because of zarr. This wholly substandard function exists for passing URL to Dask and picking the right mapper for zarr loading/saving. Taking that out of the hands of Dask without replicating it zarr or elsewhere, is exactly the sort of thing fsspec is for. It would always be optional, though, just as the current file-system backends are optional: the user would always be allowed to use their own mapping-compatible store.

@joshmoore
Copy link
Member

@rabernat, is there anything that needs resurrecting from this? or safe to close?

@rabernat
Copy link
Contributor Author

Definitely safe to close. Gcsfs meets all our needs here.

@rabernat rabernat closed this Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants