WIP: google cloud storage class #252

rabernat · 2018-03-29T20:33:19Z

First, apologies for submitting an un-solicited pull request. I know that is against the contributor guidelines. I thought this idea would be easier to discuss with a concrete implementation to look at.

In my highly opinionated view, the killer feature of zarr is its ability to efficiently store array data in cloud storage. Currently, the recommended way to do this is via outside packages (e.g. s3fs, gcsfs), which provide a MutableMapping that zarr can store things in.

In this PR, I have implemented an experimental google cloud storage class directly within zarr.

Why did I do this? Because, in the pangeo project, we are now making heavy use of the xarray -> zarr -> gcsfs -> cloud storage stack. I have come to the conclusion that a tighter coupling between zarr and gcs, via the google.cloud.storage API, may prove advantageous.

In addition to performance benefits and easier debugging, I think there are social advantages to having cloud storage as a first-class part of zarr. Lots of people want to store arrays in the cloud, and if zarr can provide this capability more natively, it could increase adoption.

Thoughts?

These tests require GCP credentials and the google cloud storage package. It is possible add encrypted credentials to travis, but I haven't done that yet. Tests are (mostly) working locally for me.

TODO:

jakirkham · 2018-03-29T21:42:29Z

Have you tried using this in your current setup already? Does it yield the speedup that you would expect?

kalvdans · 2018-03-29T21:45:46Z

What is the recommended way to test the code locally? Is there a dummy server one can use or do we have to mock the google.cloud.storage Python functions?

alimanfoo · 2018-03-29T22:07:00Z

Thanks Ryan, personally I'm very happy to entertain this. We're actually giving some serious thought to setting up a pangeo-like environment in the cloud to support our genomics work, so having an efficient interface to cloud object storage will become important for us I expect. There is also precedent for moving storage classes into zarr to allow for optimisation and increased convenience. E.g., both the DirectoryStore and ZipStore classes replicate a lot of what was already provided in the zict package, but I ended up re-implementing because there were optimisations I wanted to add and it was easier to do that within zarr.

I just took a browse through the gcsfs code and what you have implemented is a lot leaner, which is very nice. Out of interest, do you see a performance gain over using gcsfs, and do you know what is giving the gain?

$64,000 question, if we did merge this, would you be willing to join as a core developer, with no commitment other than to maintain the gcs-related code?

rabernat · 2018-03-30T17:43:56Z

Have you tried using this in your current setup already? Does it yield the speedup that you would expect?

So far, in my simple benchmarks, the performance is exactly the same as gcsfs. However, my hope is that it will be easier to debug when problems occur (see e.g. pangeo-data/pangeo#166, fsspec/gcsfs#91, fsspec/gcsfs#90 fsspec/gcsfs#89, fsspec/gcsfs#61). Also, I expect that having tighter coupling between zarr and gcs will allow us to optimize performance more going forward.

What is the recommended way to test the code locally? Is there a dummy server one can use or do we have to mock the google.cloud.storage Python functions?

I don't know about this. Finding a way to mock the google.cloud.storage would be useful. However, running the test suite on the real google service also has major advantages.

$64,000 question, if we did merge this, would you be willing to join as a core developer, with no commitment other than to maintain the gcs-related code?

Sure.

I would like to do a lot more testing before moving forward. In particular, I want to understand how this behaves in the context of distributed. Serializing this store is a bit tricky. This is pretty mature in gcsfs.

alimanfoo · 2018-03-30T22:38:55Z

On Fri, 30 Mar 2018, 18:43 Ryan Abernathey, ***@***.***> wrote: Have you tried using this in your current setup already? Does it yield the speedup that you would expect? So far, in my simple benchmarks, the performance is exactly the same as gcsfs. However, my hope is that it will be easier to debug when problems occur (see e.g. pangeo-data/pangeo#166 <pangeo-data/pangeo#166>, fsspec/gcsfs#91 <fsspec/gcsfs#91>, fsspec/gcsfs#90 <fsspec/gcsfs#90> fsspec/gcsfs#89 <fsspec/gcsfs#89>, fsspec/gcsfs#61 <fsspec/gcsfs#61>). Also, I expect that having tighter coupling between zarr and gcs will allow us to optimize performance more going forward.

Scanning those issues it looks like there's a lot still to be understood around authn/authz and performance. If having GCS support in Zarr helps unpick those issues then I'm all for it, even if it's only a temporary thing and the resolutions ultimately get implemented back in gcsfs.

What is the recommended way to test the code locally? Is there a dummy server one can use or do we have to mock the google.cloud.storage Python functions? I don't know about this. Finding a way to mock the google.cloud.storage would be useful. However, running the test suite on the *real* google service also has major advantages.

I guess CI services would need to run tests against a mocked google.cloud.storage, I don't know of any way of testing against the real service from within CI that wouldn't be costing someone money, unless there is a free CI service that runs inside Google cloud. But it would be essential to verify somehow that the full test suite runs against the real service.

$64,000 question, if we did merge this, would you be willing to join as a core developer, with no commitment other than to maintain the gcs-related code? Sure. I would like to do a lot more testing before moving forward. In particular, I want to understand how this behaves in the context of distributed. Serializing this store is a bit tricky. This is pretty mature in gcsfs.

Sounds good.

jakirkham · 2018-03-30T23:29:52Z

...unless there is a free CI service that runs inside Google cloud

It seems like this would a boon for Google as they would be making it easier for developers to build up an ecosystem around their service. Thus bringing in more customers reliant on that ecosystem. Makes me wonder if it really does exist and it will just take a little searching to find it. Even if they don't advertise it, we can probably ask and they may give us some free instance for testing.

alimanfoo · 2018-04-02T12:51:06Z

Reading https://cloudplatform.googleblog.com/2018/03/optimizing-your-Cloud-Storage-performance-Google-Cloud-Performance-Atlas.html?m=1it says that uploads are load balanced via the path of the object. If objects have similar paths (as would chunks within a Zarr array) then load balancing is not good. In this case faster parallel uploads can be achieved by modifying paths, e.g., prepending a hash. I've heard this is the same for S3 also. Might be worth trying this, i.e., modifying getitem and setitem so they transform keys by prepending hash before interacting with GCS. There would be some consequences to prepending a hash in terms of other operations less convenient or efficient, but maybe worth exploring.

…

On Sat, 31 Mar 2018, 00:29 jakirkham, ***@***.***> wrote: ...unless there is a free CI service that runs inside Google cloud It seems like this would a boon for Google as they would be making it easier for developers to build up an ecosystem around their service. Thus bringing in more customers reliant on that ecosystem. Makes me wonder if it really does exist and it will just take a little searching to find it. Even if they don't advertise it, we can probably ask and they may give us some free instance for testing. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qn4UiEo6GpanK2NTjR1fO1cADDnfks5tjr_wgaJpZM4TA8T4> .

martindurant · 2018-04-07T15:32:34Z

I'll be happy to help here if there's anything I can do.

alimanfoo · 2018-04-09T08:12:36Z

Thanks Martin.

…

On Sat, 7 Apr 2018, 16:32 Martin Durant, ***@***.***> wrote: I'll be happy to help here if there's anything I can do. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QpxFmbL-_75sw-GnkRuj9G-UXzJtks5tmNwSgaJpZM4TA8T4> .

tjcrone · 2018-04-13T16:04:41Z

I tried installing with this commit using:

pip install git+https://github.com/rabernat/zarr@17c9f11073004689b0a505b2df0fd0c0437cad41

which worked, and installed as:

zarr                      2.1.5.dev478+dirty           <pip>

However when I tried running ds.to_zarr(), I got the following error:

Zarr version 2.2 or greater is required by xarray. See zarr installation http://zarr.readthedocs.io/en/stable/#installation

Any ideas on how to fix this? Thanks!

cc @friedrichknuth

jakirkham · 2018-04-13T16:46:07Z

Probably need to merge with latest master locally.

tjcrone · 2018-04-13T16:57:19Z

Okay thanks. I will go ahead and fork from @rabernat and merge master to test this.

tjcrone · 2018-04-13T18:52:29Z

Actually the right way to do this was to fork upstream and then add @rabernat's repo as a remote and merge his gcs_store branch. Git wranglin!

tjcrone · 2018-04-13T19:51:03Z

now when i try ds.to_zarr('pangeo-asdf2'), I get:

distributed.scheduler - ERROR - error from worker tcp://10.244.59.6:45973: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.67.6:36687: array not found at path 'video'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-823bd5f45788> in <module>()
----> 1 ds.to_zarr('pangeo-asdf2')

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding)
   1163         from ..backends.api import to_zarr
   1164         return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
-> 1165                        group=group, encoding=encoding)
   1166 
   1167     def __unicode__(self):

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding)
    777     # I think zarr stores should always be sync'd immediately
    778     # TODO: figure out how to properly handle unlimited_dims
--> 779     dataset.dump_to_store(store, sync=True, encoding=encoding)
    780     return store

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in dump_to_store(self, store, encoder, sync, encoding, unlimited_dims)
   1068                     unlimited_dims=unlimited_dims)
   1069         if sync:
-> 1070             store.sync()
   1071 
   1072     def to_netcdf(self, path=None, mode='w', format=None, group=None,

/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in sync(self)
    365 
    366     def sync(self):
--> 367         self.writer.sync()
    368 
    369 

/opt/conda/lib/python3.6/site-packages/xarray/backends/common.py in sync(self)
    268         if self.sources:
    269             import dask.array as da
--> 270             da.store(self.sources, self.targets, lock=self.lock)
    271             self.sources = []
    272             self.targets = []

/opt/conda/lib/python3.6/site-packages/dask/array/core.py in store(sources, targets, lock, regions, compute, return_stored, **kwargs)
    953 
    954         if compute:
--> 955             result.compute(**kwargs)
    956             return None
    957         else:

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    153         dask.base.compute
    154         """
--> 155         (result,) = compute(self, traverse=False, **kwargs)
    156         return result
    157 

/opt/conda/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    402     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
    403                     else (None, a) for a in args]
--> 404     results = get(dsk, keys, **kwargs)
    405     results_iter = iter(results)
    406     return tuple(a if f is None else f(next(results_iter), *a)

/opt/conda/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, **kwargs)
   2089             try:
   2090                 results = self.gather(packed, asynchronous=asynchronous,
-> 2091                                       direct=direct)
   2092             finally:
   2093                 for f in futures.values():

/opt/conda/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1501             return self.sync(self._gather, futures, errors=errors,
   1502                              direct=direct, local_worker=local_worker,
-> 1503                              asynchronous=asynchronous)
   1504 
   1505     @gen.coroutine

/opt/conda/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    613             return future
    614         else:
--> 615             return sync(self.loop, func, *args, **kwargs)
    616 
    617     def __repr__(self):

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    251             e.wait(10)
    252     if error[0]:
--> 253         six.reraise(*error[0])
    254     else:
    255         return result[0]

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/opt/conda/lib/python3.6/site-packages/distributed/utils.py in f()
    236             yield gen.moment
    237             thread_state.asynchronous = True
--> 238             result[0] = yield make_coro()
    239         except Exception as exc:
    240             error[0] = sys.exc_info()

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1097 
   1098                     try:
-> 1099                         value = future.result()
   1100                     except Exception:
   1101                         self.had_exception = True

/opt/conda/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1105                     if exc_info is not None:
   1106                         try:
-> 1107                             yielded = self.gen.throw(*exc_info)
   1108                         finally:
   1109                             # Break up a reference to itself

/opt/conda/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1379                             six.reraise(type(exception),
   1380                                         exception,
-> 1381                                         traceback)
   1382                     if errors == 'skip':
   1383                         bad_keys.add(key)

/opt/conda/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    690                 value = tp()
    691             if value.__traceback__ is not tb:
--> 692                 raise value.with_traceback(tb)
    693             raise value
    694         finally:

/opt/conda/lib/python3.6/site-packages/distributed/protocol/pickle.py in loads()
     57 def loads(x):
     58     try:
---> 59         return pickle.loads(x)
     60     except Exception:
     61         logger.info("Failed to deserialize %s", x[:10000], exc_info=True)

/opt/conda/lib/python3.6/site-packages/zarr/core.py in __setstate__()
   1928 
   1929     def __setstate__(self, state):
-> 1930         self.__init__(*state)
   1931 
   1932     def _synchronized_op(self, f, *args, **kwargs):

/opt/conda/lib/python3.6/site-packages/zarr/core.py in __init__()
    121 
    122         # initialize metadata
--> 123         self._load_metadata()
    124 
    125         # initialize attributes

/opt/conda/lib/python3.6/site-packages/zarr/core.py in _load_metadata()
    138         """(Re)load metadata from store."""
    139         if self._synchronizer is None:
--> 140             self._load_metadata_nosync()
    141         else:
    142             mkey = self._key_prefix + array_meta_key

/opt/conda/lib/python3.6/site-packages/zarr/core.py in _load_metadata_nosync()
    149             meta_bytes = self._store[mkey]
    150         except KeyError:
--> 151             err_array_not_found(self._path)
    152         else:
    153 

/opt/conda/lib/python3.6/site-packages/zarr/errors.py in err_array_not_found()
     23 
     24 def err_array_not_found(path):
---> 25     raise ValueError('array not found at path %r' % path)
     26 
     27 

ValueError: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.68.6:45095: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.64.6:36365: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.57.6:46707: array not found at path 'video'
distributed.scheduler - ERROR - error from worker tcp://10.244.69.6:43569: array not found at path 'video'

rabernat · 2018-04-13T19:57:48Z

@tjcrone: your error is unrelated to this PR. When you call ds.to_zarr('string'), you write to a zarr.DirectoryStore by default. You are getting an error because the distributed workers can't see the directory.

In order to use this store, you need to create a GCSStore directly, e.g.

gcsstore = zarr.storage.GCSStore('zarr-test', 'zarr-gcs-store', client_kwargs={'project': 'pangeo-181919'})
ds.to_zarr(gcsstore)

Note that the branch is highly experimental, work in progress, and consequently highly error prone.

tjcrone · 2018-04-13T20:13:52Z

Thanks @rabernat. I'll make sure to test this all locally inside the notebook server for now and not involve workers. When trying your suggestion,

gcsstore = zarr.storage.GCSStore('zarr-test', 'zarr-gcs-store', client_kwargs={'project': 'pangeo-181919'})

(and changing out for my project id), I got the following error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-22-27e581132598> in <module>()
----> 1 gcsstore = zarr.storage.GCSStore('pangeo-asdf', client_kwargs={'project': 'pangeo-198314'})

/opt/conda/lib/python3.6/site-packages/zarr/storage.py in __init__(self, bucket_name, prefix, client_kwargs)
   1941         self.prefix = normalize_storage_path(prefix)
   1942         self.client_kwargs = {}
-> 1943         self.initialize_bucket()
   1944 
   1945     def initialize_bucket(self):

/opt/conda/lib/python3.6/site-packages/zarr/storage.py in initialize_bucket(self)
   1946         from google.cloud import storage
   1947         # run `gcloud auth application-default login` from shell
-> 1948         client = storage.Client(**self.client_kwargs)
   1949         self.bucket = client.get_bucket(self.bucket_name)
   1950         # need to properly handle excpetions

/opt/conda/lib/python3.6/site-packages/google/cloud/storage/client.py in __init__(self, project, credentials, _http)
     57         self._base_connection = None
     58         super(Client, self).__init__(project=project, credentials=credentials,
---> 59                                      _http=_http)
     60         self._connection = Connection(self)
     61         self._batch_stack = _LocalStack()

/opt/conda/lib/python3.6/site-packages/google/cloud/client.py in __init__(self, project, credentials, _http)
    213 
    214     def __init__(self, project=None, credentials=None, _http=None):
--> 215         _ClientProjectMixin.__init__(self, project=project)
    216         Client.__init__(self, credentials=credentials, _http=_http)

/opt/conda/lib/python3.6/site-packages/google/cloud/client.py in __init__(self, project)
    169         project = self._determine_default(project)
    170         if project is None:
--> 171             raise EnvironmentError('Project was not passed and could not be '
    172                                    'determined from the environment.')
    173         if isinstance(project, six.binary_type):

OSError: Project was not passed and could not be determined from the environment.

I feel like the project was passed.

rabernat · 2018-04-13T20:18:43Z

zarr/storage.py

+
+        self.bucket_name = bucket_name
+        self.prefix = normalize_storage_path(prefix)
+        self.client_kwargs = {}


@tjcrone I guess because client_kwargs is not actually initialized properly!

tjcrone · 2018-04-13T21:39:17Z

Thanks @rabernat. This fix worked! Nice. The only comment I have at this stage is that when a group within the bucket already exists we get:

ValueError: path '' contains a group

which might be better if it was more explanatory and also better if it pulled the path into the error.

alimanfoo · 2018-04-13T21:53:09Z

If path is an empty string that means the root path within the store. That isn't obvious and could be improved. Zarr errs on the side of caution and will not overwrite an existing group or array by default. You can force it to overwrite by passing overwrite=True to most creation functions.

…

On Fri, 13 Apr 2018, 22:39 Tim Crone, ***@***.***> wrote: Thanks @rabernat <https://github.com/rabernat>. This fix worked! Nice. The only comment I have at this stage is that when a group within the bucket already exists we get: ValueError: path '' contains a group which might be better if it was more explanatory and also better if it pulled the path into the error. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QqzGXSrbWxaNO3tiEnd7xqCGD61Uks5toRsFgaJpZM4TA8T4> .

tjcrone · 2018-04-14T00:19:52Z

I think it is worth noting that for this particular extension, when the path is not empty, it shows up as empty in the error:

ValueError: path '' contains a group

It is also worth noting that an indication that the path "already" contains a group would go a long way to making this error more helpful. Or saying that the group "already exists" also works.

alimanfoo · 2018-04-14T08:05:49Z

On Sat, 14 Apr 2018, 01:19 Tim Crone, ***@***.***> wrote: I think it is worth noting that for this particular extension, when the path is not empty, it shows up as empty in the error: ValueError: path '' contains a group Ah OK. Would you mind posting the code that generates this error, from the

point where you create the GCSStore.

It is also worth noting that an indication that the path "already" contains a group would go a long way to making this error more helpful. Or saying that the group "already exists" also works.

Yes that would be better. —

…

You are receiving this because you commented. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QnRJsNJuYPlrx-PO5BkyitO3Ym9hks5toUCogaJpZM4TA8T4> .

tjcrone · 2018-04-14T10:46:01Z

Sorry for not including an example.

This code works fine when it runs the first time, as long as the rte-pangeo-data bucket exists (it will not create a bucket):

import zarr
gcsstore = zarr.storage.GCSStore('rte-pangeo-data', 'test1', client_kwargs={'project': 'pangeo-198314'})
ds.to_zarr(gcsstore)

But when run a second time it gives:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-f5593b801545> in <module>()
----> 1 ds.to_zarr(gcsstore)

/opt/conda/lib/python3.6/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding)
   1163         from ..backends.api import to_zarr
   1164         return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
-> 1165                        group=group, encoding=encoding)
   1166 
   1167     def __unicode__(self):

/opt/conda/lib/python3.6/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding)
    773     store = backends.ZarrStore.open_group(store=store, mode=mode,
    774                                           synchronizer=synchronizer,
--> 775                                           group=group, writer=None)
    776 
    777     # I think zarr stores should always be sync'd immediately

/opt/conda/lib/python3.6/site-packages/xarray/backends/zarr.py in open_group(cls, store, mode, synchronizer, group, writer)
    258                                       "#installation" % min_zarr)
    259         zarr_group = zarr.open_group(store=store, mode=mode,
--> 260                                      synchronizer=synchronizer, path=group)
    261         return cls(zarr_group, writer=writer)
    262 

/opt/conda/lib/python3.6/site-packages/zarr/hierarchy.py in open_group(store, mode, cache_attrs, synchronizer, path)
   1126             err_contains_array(path)
   1127         elif contains_group(store, path=path):
-> 1128             err_contains_group(path)
   1129         else:
   1130             init_group(store, path=path)

/opt/conda/lib/python3.6/site-packages/zarr/errors.py in err_contains_group(path)
     15 
     16 def err_contains_group(path):
---> 17     raise ValueError('path %r contains a group' % path)
     18 
     19 

ValueError: path '' contains a group

tjcrone · 2018-04-14T14:53:24Z

In case anyone wants to test this with multiple Dask workers, it is possible to create an Oauth credentials object using the following:

import google.auth
credentials, project = google.auth.default()

and then passing this credentials object when creating the gcsstore:

gcsstore = zarr.storage.GCSStore('rte-pangeo-data', 'test1', client_kwargs={'project': 'pangeo-198314', 'credentials': credentials})

For this to work as written it would be necessary to authenticate using the Google Cloud SDK:

gcloud auth application-default login --no-launch-browser

So far in all of my testing this code is working great!

rabernat · 2018-04-14T15:23:07Z

Can credentials be pickled?

…

Sent from my iPhone

On Apr 14, 2018, at 10:53 AM, Tim Crone ***@***.***> wrote: In case anyone wants to test this with multiple Dask workers, it is possible to create an Oauth credentials object using the following: import google.auth credentials, project = google.auth.default() and then passing this credentials object when creating the gcsstore: gcsstore = zarr.storage.GCSStore('rte-pangeo-data', 'test1', client_kwargs={'project': 'pangeo-198314', 'credentials': credentials}) For this to work as written it would be necessary to authenticate using the Google Cloud SDK: gcloud auth application-default login --no-launch-browser So far in all of my testing this code is working great! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

tjcrone · 2018-04-14T15:30:17Z

Interesting question. How would I test this? Note that this credentials object is secret, and allows a lot of access to my own GCP resources. I'm not sure I would want to pickle and distribute. There are ways of creating credentials with reduced permissions: https://google-auth.readthedocs.io/en/latest/user-guide.html.

jakirkham · 2018-08-10T20:50:19Z

FWIW opened issue ( https://github.com/zarr-developers/zarr/issues/290 ) to discuss cloud support generally. Please feel free to share anything relevant there.

alimanfoo

A couple of comments from experiences trying this out and looking at performance for retrieving small objects.

alimanfoo · 2018-09-06T21:52:20Z

zarr/storage.py

+    `default credentials <https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login>`_.
+    """
+
+    def __init__(self, bucket_name, prefix=None, client_kwargs={}):


Might be worth adding an option to use anonymous client. E.g., add an anonymous=False keyword argument, then make use of storage.Client.create_anonymous_client() when it comes to creating the client if user has provided anonymous=True.

alimanfoo · 2018-09-06T21:54:10Z

zarr/storage.py

+        from google.cloud import storage
+        # run `gcloud auth application-default login` from shell
+        client = storage.Client(**self.client_kwargs)
+        self.bucket = client.get_bucket(self.bucket_name)


Note that it's also possible to do:

self.bucket = storage.Bucket(client, name=self.bucket_name)

...which involves no network communication. Not sure this is a good idea in general as may want to retrieve the bucket info, but just mentioning.

alimanfoo · 2018-09-06T21:57:01Z

zarr/storage.py

+
+    def __getitem__(self, key):
+        blob_name = self.full_path(key)
+        blob = self.bucket.get_blob(blob_name)


An alternative here is to do:

from google.cloud import storage blob = storage.Blob(blob_name, self.bucket)

...which involves less network communication (profiling shows number of calls to method 'read' of '_ssl._SSLSocket' objects goes from 3 down to 1) and reduces the time to retrieve small objects by around 50%.

If this change was made, some rethinking of error handling may be needed, as the point at which a non-existing blob was detected might change.

alimanfoo · 2018-09-06T22:12:17Z

@rabernat just to say that, FWIW, I think this is worth pursuing. I know @martindurant has just added in some improvements to gcsfs to reduce latency and use checksums to verify content, which is great, and so both the performance and error reporting issues that have come up with gcsfs may be resolved. But while we are still gaining experience with GCS, also having a mapping implementation based on the google cloud client library I think would be valuable, so we can compare performance and see if issues replicate across both this and gcsfs or not. Obviously contingent on you (or someone else) having time and inclination, but I'm currently also starting to use google cloud storage and so would be happy to chip in.

rabernat · 2018-09-07T02:07:36Z

Great! I agree it is a good way forward.

Moving forward, I think a good question (related to #290) is whether we want to have a generic base class for object stores and then extend that for GCS, S3, ABS, etc. Maybe this is overcomplicating things though...

The reality is that I am teaching two classes this semester and am unlikely to have the time to dig into this deeply any time soon. You, @martindurant, @tjcrone, and @jakirkham are all clearly qualified to pick up where I left off here.

martindurant · 2018-09-07T03:29:00Z

Having a mapping class over generic file-system implementations was one of the points of fsspec, which will, of course, look rather familiar.

alimanfoo · 2018-09-08T08:33:19Z

Thanks @martindurant, FWIW I absolutely see the virtues of the fsspec approach. For where we are right now, from a purely pragmatic point of view, think it would be useful to have both gcsfs and the mapping implementation in this PR. We can compare performance, see if issues replicate with both, and compare how much code is needed and how it's organised. Also we have some very helpful contacts at Google genomics who are standing by to help with any issues that come up as we start to test at scale, so having a code path that uses Google's own client library will be useful I think. So I'm not trying to champion one approach over there other. Given that this PR seems very close to being complete, I think it's worth finishing it and releasing it as part of the Zarr experimental API. Long-term it might get deprecated in favour of gcsfs. All that said, my own time to work on Zarr is going to be quite limited for the rest of the year, so I don't know yet if I'll have time to work on this.

…

On Fri, 7 Sep 2018, 04:29 Martin Durant, ***@***.***> wrote: Having a mapping class over generic file-system implementations was one of the points of fsspec <https://github.com/martindurant/filesystem_spec/blob/master/fsspec/mapping.py>, which will, of course, look rather familiar. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QjuH4zorgS80DStcgHiZfsXwGpnEks5uYef8gaJpZM4TA8T4> .

alimanfoo · 2018-10-04T21:55:20Z

Btw it looks like there is no local emulator for GCS, and this is an open issue for the google cloud Python client library: https://github.com/googleapis/google-cloud-python/issues/4840 (see also googleapis/google-cloud-python#4897).

martindurant · 2018-10-04T21:59:17Z

@alimanfoo - well aware of this, and had to jump through a number of uncomfortable hoops to test gcsfs using vcrpy (which can record and mock any urllib calls, but not easily). moto3 and azurite really help for s3 and azure-datalake/blob in this respect.

alimanfoo · 2018-10-04T23:52:24Z

@martindurant thanks, yes one of your comments was what prompted me to do a bit of digging. I guess it might be worth trying to put a bit of pressure on Google folks, it doesn't look like they've prioritised this highly.

jakirkham · 2019-03-02T01:45:34Z

Is this still interesting, @rabernat? FWIW I'd be +1 on getting this integrated. Probably still some things to address before merging though.

alimanfoo · 2019-03-02T08:45:15Z

Btw I was recently looking at the API docs for the Google cloud storage SDK and I believe there is support for local emulation now, which could make unit testing much easier.

…

On Sat, 2 Mar 2019, 01:45 jakirkham, ***@***.***> wrote: Is this still interesting, @rabernat <https://github.com/rabernat>? FWIW I'd be +1 on getting this integrated. Probably still some things to address before merging though. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qj3tBXnrV2WB-IECtDB4qJfFWneVks5vSde-gaJpZM4TA8T4> .

martindurant · 2019-03-02T13:25:27Z

there is support for local emulation

Where did you see that? It would make testing gcsfs much easier!

alimanfoo · 2019-03-25T09:09:32Z

I knew I should have posted a link, I thought it would be easy to find again but I can't seem to find anything now, must have dreamt it! Sorry for the confusion.

…

On Sat, 2 Mar 2019 at 13:25, Martin Durant ***@***.***> wrote: there is support for local emulation Where did you see that? It would make testing gcsfs *much* easier! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#252 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QuBm5w6PtFXr04lsZORNMwaK5jxKks5vSnvIgaJpZM4TA8T4> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo> Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

rabernat · 2019-03-25T13:48:58Z

Before adding any new storage classes, I think we should address #414 and #301. The storage module is already super long and, at this point, pretty random. We have direct support for Azure and Redis but not S3?

I basically stopped working on this because I wanted to see how filesystemspec evolved. Maybe now is the time to assess whether we want to continue implementing more storage classes or add filesystemspec as a dependency. Someone also recently posted a link to another key/value abstraction we could look at, but I can't find the link.

This is also tied to the spec discussion. As we start dumping zarr in to more and more stores, how do we ensure that these will be readable by all zarr implementations?

My view is that we need more explicit specs for file and cloud storage that will allow other libraries to implement them in a compatible way.

martindurant · 2019-03-25T19:55:54Z

Should we discuss this explicitly at the next zarr meeting? I think a little effort should be spent integrating gcsfs, s3fs and adlfs (or blob) completely with fsspec to complete the picture.

jakirkham · 2019-03-25T22:03:53Z

What follows are merely my opinions, please feel free to agree or disagree with them as you see fit.

Refactoring out the existing storage layers into an independent library is probably already a worthwhile endeavor. There seem to be other people out there looking for or trying similar things. So it would be useful to engage them at that level. It will probably also make it easier to handle optional dependencies as was needed recently for Azure.

So S3 already works without any effort on our end. Namely s3fs provides S3Map, which we are able to use as noted in the tutorial. Though am now noticing there is GCSMap, which is maybe already sufficient? Though maybe you have tried this already. If so, what issues did you encounter?

Personally I'm not convinced that Zarr needs to use filesystem_spec. That isn't to say we shouldn't allow people the option or make sure things work in case people would like to use it. Am just not thinking it needs to be a required dependency of Zarr.

Are you thinking of simplekv? Posted that in the refactoring issue ( #414 ). They definitely have some similarities to us. Though it also has a few extra features we probably don't need, but it is a closer match to what we have. Probably worthwhile to work with them on a solution if they're interested.

It's reasonable to be concerned about spec impact. Though I think the key-value stores are the least of our concerns as they are already well-articulated and unlikely to change (unless we start adding links and references). The concerning part from the spec point of view is how we handle translations of data to key-value pairs in these stores. For example object types are something that already is an obstacle to compatibility that we will need to figure out. There are a few other examples as well.

martindurant · 2019-03-26T13:49:02Z

So S3 already works without any effort on our end. Namely s3fs provides S3Map, which we are able to use as noted in the tutorial. Though am now noticing there is GCSMap, which is maybe already sufficient? Though maybe you have tried this already. If so, what issues did you encounter?

S3Map and GCSMap are currently used for accessing zarr (which is why I was not convinced of the need for a google mapping implementation or in fact azure blob). Indeed, some access patterns were changed in gcsfs specifically because of zarr. This wholly substandard function exists for passing URL to Dask and picking the right mapper for zarr loading/saving. Taking that out of the hands of Dask without replicating it zarr or elsewhere, is exactly the sort of thing fsspec is for. It would always be optional, though, just as the current file-system backends are optional: the user would always be allowed to use their own mapping-compatible store.

joshmoore · 2021-11-23T14:18:41Z

@rabernat, is there anything that needs resurrecting from this? or safe to close?

rabernat · 2021-11-23T16:03:36Z

Definitely safe to close. Gcsfs meets all our needs here.

rabernat added 2 commits March 29, 2018 15:05

basic implementation working

86a91f5

docs and cleanup

17c9f11

fixed client_kwargs bug

18cd824

rabernat changed the title ~~google cloud storage class~~ WIP: google cloud storage class Mar 30, 2018

rabernat commented Apr 13, 2018

View reviewed changes

tjcrone mentioned this pull request Apr 14, 2018

Writing to FUSE mount for workers pangeo-data/pangeo#215

Closed

tjcrone referenced this pull request in shikharsg/zarr Aug 16, 2018

fixed pickle tests

b8f60fe

rabernat mentioned this pull request Sep 6, 2018

intermittent errors during blosc decompression of zarr chunks on pangeo.pydata.org pangeo-data/pangeo#196

Closed

alimanfoo mentioned this pull request Sep 6, 2018

Zarr/GCS potential optimisation to reduce latency pangeo-data/pangeo#381

Closed

alimanfoo reviewed Sep 6, 2018

View reviewed changes

jhamman mentioned this pull request Sep 10, 2018

cloud optimized netCDF and zarr ESIPFed/NUMfocusFallDev#4

Open

alimanfoo mentioned this pull request Sep 20, 2018

Add MongoDB storage backend? #299

Closed

rabernat mentioned this pull request Sep 23, 2018

Refactor storage around abstract file system? #301

Open

rabernat mentioned this pull request Dec 28, 2018

HTTP Store #373

Closed

7 tasks

alimanfoo mentioned this pull request Jan 31, 2019

S3 anonymous Zarr data examples... #385

Open

jakirkham mentioned this pull request Jul 3, 2019

Cloud Storage zarr-developers/community#13

Open

skgbanga mentioned this pull request Sep 28, 2020

Bulk write speed using zarr and GCS #619

Closed

rabernat closed this Nov 23, 2021

WIP: google cloud storage class #252

WIP: google cloud storage class #252

Conversation

rabernat commented Mar 29, 2018

jakirkham commented Mar 29, 2018

kalvdans commented Mar 29, 2018

alimanfoo commented Mar 29, 2018

rabernat commented Mar 30, 2018

alimanfoo commented Mar 30, 2018 via email

jakirkham commented Mar 30, 2018

alimanfoo commented Apr 2, 2018 via email

martindurant commented Apr 7, 2018

alimanfoo commented Apr 9, 2018 via email

tjcrone commented Apr 13, 2018

jakirkham commented Apr 13, 2018

tjcrone commented Apr 13, 2018

tjcrone commented Apr 13, 2018

tjcrone commented Apr 13, 2018

rabernat commented Apr 13, 2018

tjcrone commented Apr 13, 2018

rabernat Apr 13, 2018

Choose a reason for hiding this comment

tjcrone commented Apr 13, 2018

alimanfoo commented Apr 13, 2018 via email

tjcrone commented Apr 14, 2018

alimanfoo commented Apr 14, 2018 via email

tjcrone commented Apr 14, 2018

tjcrone commented Apr 14, 2018

rabernat commented Apr 14, 2018 via email

tjcrone commented Apr 14, 2018

jakirkham commented Aug 10, 2018

alimanfoo left a comment

Choose a reason for hiding this comment

alimanfoo Sep 6, 2018

Choose a reason for hiding this comment

alimanfoo Sep 6, 2018

Choose a reason for hiding this comment

alimanfoo Sep 6, 2018

Choose a reason for hiding this comment

alimanfoo commented Sep 6, 2018

rabernat commented Sep 7, 2018

martindurant commented Sep 7, 2018

alimanfoo commented Sep 8, 2018 via email

alimanfoo commented Oct 4, 2018

martindurant commented Oct 4, 2018

alimanfoo commented Oct 4, 2018

jakirkham commented Mar 2, 2019

alimanfoo commented Mar 2, 2019 via email

martindurant commented Mar 2, 2019

alimanfoo commented Mar 25, 2019 via email

rabernat commented Mar 25, 2019 • edited

martindurant commented Mar 25, 2019

jakirkham commented Mar 25, 2019

martindurant commented Mar 26, 2019 • edited

joshmoore commented Nov 23, 2021

rabernat commented Nov 23, 2021

rabernat commented Mar 25, 2019 •

edited

martindurant commented Mar 26, 2019 •

edited