Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate zarr metadata into single key #268

Merged
merged 39 commits into from
Nov 14, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
8301fa6
POC of making a single file out of zarr dot files
Jun 26, 2018
be6d706
(WIP) include simple code that would load metadata
Jul 2, 2018
f1128ff
Implement ConsolidatedMetadataStore
Aug 2, 2018
6666391
fix for py34 py35
Aug 2, 2018
a369073
improve coverage; data write in consolidated store
Aug 2, 2018
96e1fb0
coverage
Aug 2, 2018
7a5c81d
POC of making a single file out of zarr dot files
Jun 26, 2018
0711920
(WIP) include simple code that would load metadata
Jul 2, 2018
f8e6a2f
Implement ConsolidatedMetadataStore
Aug 2, 2018
c4436c7
fix for py34 py35
Aug 2, 2018
0757a72
improve coverage; data write in consolidated store
Aug 2, 2018
da3f6d7
coverage
Aug 2, 2018
9e1c2c0
Merge branch 'consolidate_metadata' of https://github.com/martinduran…
Oct 18, 2018
552a084
POC of making a single file out of zarr dot files
Jun 26, 2018
8f3325f
(WIP) include simple code that would load metadata
Jul 2, 2018
5da425f
Implement ConsolidatedMetadataStore
Aug 2, 2018
e62d39c
fix for py34 py35
Aug 2, 2018
01e815a
improve coverage; data write in consolidated store
Aug 2, 2018
1561ead
coverage
Aug 2, 2018
03d1dbc
doc and param style
alimanfoo Oct 18, 2018
4e55548
add test for nchunks_initialized
alimanfoo Oct 18, 2018
c283487
expose chunk_store param in open* functions
alimanfoo Oct 18, 2018
cc9d7c7
implement open_consolidated
alimanfoo Oct 18, 2018
a14b045
tweaks to consolidated behaviour
alimanfoo Oct 18, 2018
0cbda15
py2 fix
alimanfoo Oct 18, 2018
e890015
Merge branch 'consolidate_metadata' of https://github.com/martinduran…
Oct 23, 2018
b4b60aa
Update docstrings
Oct 23, 2018
cae30da
added api docs; consistify references in docstrings
alimanfoo Oct 31, 2018
ba99cfa
add tests
alimanfoo Nov 1, 2018
6f01dec
Add section to tutorial, add to release notes
Nov 1, 2018
89bde83
Merge branch 'master' into consolidate_metadata
Nov 1, 2018
f5130ac
fix getsize test
alimanfoo Nov 1, 2018
9135ad1
Merge branch 'consolidate_metadata' of github.com:martindurant/zarr i…
alimanfoo Nov 1, 2018
3d3cb2f
add setuptools-scm to dev env so can go fully offline
alimanfoo Nov 1, 2018
8acf83a
fix requirements
alimanfoo Nov 1, 2018
2f89535
skip consolidate doctests; minor edits
alimanfoo Nov 1, 2018
c8ed0f6
fix refs [ci skip]
alimanfoo Nov 1, 2018
9c0c621
make consolidated metadata human-readable
alimanfoo Nov 3, 2018
ccef26c
comments [ci skip]
alimanfoo Nov 6, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/api/convenience.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ Convenience functions (``zarr.convenience``)
.. autofunction:: copy_all
.. autofunction:: copy_store
.. autofunction:: tree
.. autofunction:: consolidate_metadata
.. autofunction:: open_consolidated
2 changes: 2 additions & 0 deletions docs/api/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Storage (``zarr.storage``)
.. automethod:: invalidate_values
.. automethod:: invalidate_keys

.. autoclass:: ConsolidatedMetadataStore

.. autofunction:: init_array
.. autofunction:: init_group
.. autofunction:: contains_array
Expand Down
7 changes: 7 additions & 0 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@ Release notes
Enhancements
~~~~~~~~~~~~

* Add "consolidated" metadata as an experimental feature: use
:func:`zarr.convenience.consolidate_metadata` to copy all metadata from the various
metadata keys within a dataset hierarchy under a single key, and
:func:`zarr.convenience.open_consolidated` to use this single key. This can greatly
cut down the number of calls to the storage backend, and so remove a lot of overhead
for reading remote data. By :user:`Martin Durant <martindurant>`, :issue:`268`.

* Support has been added for structured arrays with sub-array shape and/or nested fields. By
:user:`Tarik Onalan <onalant>`, :issue:`111`, :issue:`296`.

Expand Down
54 changes: 47 additions & 7 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -778,9 +778,11 @@ chunk size, which will reduce the number of chunks and thus reduce the number of
round-trips required to retrieve data for an array (and thus reduce the impact of network
latency). Another option is to try to increase the compression ratio by changing
compression options or trying a different compressor (which will reduce the impact of
limited network bandwidth). As of version 2.2, Zarr also provides the
:class:`zarr.storage.LRUStoreCache` which can be used to implement a local in-memory cache
layer over a remote store. E.g.::
limited network bandwidth).

As of version 2.2, Zarr also provides the :class:`zarr.storage.LRUStoreCache`
which can be used to implement a local in-memory cache layer over a remote
store. E.g.::

>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
Expand All @@ -797,13 +799,51 @@ layer over a remote store. E.g.::
b'Hello from the cloud!'
0.0009490990014455747

If you are still experiencing poor performance with distributed/cloud storage, please
raise an issue on the GitHub issue tracker with any profiling data you can provide, as
there may be opportunities to optimise further either within Zarr or within the mapping
interface to the storage.
If you are still experiencing poor performance with distributed/cloud storage,
please raise an issue on the GitHub issue tracker with any profiling data you
can provide, as there may be opportunities to optimise further either within
Zarr or within the mapping interface to the storage.

.. _tutorial_copy:

Consolidating metadata
~~~~~~~~~~~~~~~~~~~~~~

(This is an experimental feature.)

Since there is a significant overhead for every connection to a cloud object
store such as S3, the pattern described in the previous section may incur
significant latency while scanning the metadata of the dataset hierarchy, even
though each individual metadata object is small. For cases such as these, once
the data are static and can be regarded as read-only, at least for the
metadata/structure of the dataset hierarchy, the many metadata objects can be
consolidated into a single one via
:func:`zarr.convenience.consolidate_metadata`. Doing this can greatly increase
the speed of reading the dataset metadata, e.g.::

>>> zarr.consolidate_metadata(store) # doctest: +SKIP

This creates a special key with a copy of all of the metadata from all of the
metadata objects in the store.

Later, to open a Zarr store with consolidated metadata, use
:func:`zarr.convenience.open_consolidated`, e.g.::

>>> root = zarr.open_consolidated(store) # doctest: +SKIP

This uses the special key to read all of the metadata in a single call to the
backend storage.

Note that, the hierarchy could still be opened in the normal way and altered,
causing the consolidated metadata to become out of sync with the real state of
the dataset hierarchy. In this case,
:func:`zarr.convenience.consolidate_metadata` would need to be called again.

To protect against consolidated metadata accidentally getting out of sync, the
root group returned by :func:`zarr.convenience.open_consolidated` is read-only
for the metadata, meaning that no new groups or arrays can be created, and
arrays cannot be resized. However, data values with arrays can still be updated.

Copying/migrating data
----------------------

Expand Down
1 change: 1 addition & 0 deletions requirements_dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ python-dateutil==2.7.3
readme-renderer==22.0
requests==2.19.1
requests-toolbelt==0.8.0
setuptools-scm==3.1.0
s3fs==0.1.6
s3transfer==0.1.13
scandir==1.9.0
Expand Down
3 changes: 2 additions & 1 deletion zarr/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from zarr.sync import ThreadSynchronizer, ProcessSynchronizer
from zarr.codecs import *
from zarr.convenience import (open, save, save_array, save_group, load, copy_store,
copy, copy_all, tree)
copy, copy_all, tree, consolidate_metadata,
open_consolidated)
from zarr.errors import CopyError, MetadataError, PermissionError
from zarr.version import version as __version__
4 changes: 2 additions & 2 deletions zarr/attrs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
from collections import MutableMapping


from zarr.compat import text_type
from zarr.errors import PermissionError
from zarr.meta import parse_metadata


class Attributes(MutableMapping):
Expand Down Expand Up @@ -43,7 +43,7 @@ def _get_nosync(self):
except KeyError:
d = dict()
else:
d = json.loads(text_type(data, 'ascii'))
d = parse_metadata(data)
return d

def asdict(self):
Expand Down
4 changes: 4 additions & 0 deletions zarr/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ class PermissionError(Exception):
def OrderedDict_move_to_end(od, key):
od[key] = od.pop(key)

from collections import Mapping


else: # pragma: py2 no cover

Expand All @@ -29,3 +31,5 @@ def OrderedDict_move_to_end(od, key):

def OrderedDict_move_to_end(od, key):
od.move_to_end(key)

from collections.abc import Mapping
126 changes: 120 additions & 6 deletions zarr/convenience.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,28 +15,34 @@
from zarr.errors import err_path_not_found, CopyError
from zarr.util import normalize_storage_path, TreeViewer, buffer_size
from zarr.compat import PY2, text_type
from zarr.meta import ensure_str, json_dumps


# noinspection PyShadowingBuiltins
def open(store, mode='a', **kwargs):
def open(store=None, mode='a', **kwargs):
"""Convenience function to open a group or array using file-mode-like semantics.

Parameters
----------
store : MutableMapping or string
store : MutableMapping or string, optional
Store or path to directory in file system or name of zip file.
mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
Persistence mode: 'r' means read only (must exist); 'r+' means
read/write (must exist); 'a' means read/write (create if doesn't
exist); 'w' means create (overwrite if exists); 'w-' means create
(fail if exists).
**kwargs
Additional parameters are passed through to :func:`zarr.open_array` or
:func:`zarr.open_group`.
Additional parameters are passed through to :func:`zarr.creation.open_array` or
:func:`zarr.hierarchy.open_group`.

Returns
-------
z : :class:`zarr.core.Array` or :class:`zarr.hierarchy.Group`
Array or group, depending on what exists in the given store.

See Also
--------
zarr.open_array, zarr.open_group
zarr.creation.open_array, zarr.hierarchy.open_group

Examples
--------
Expand Down Expand Up @@ -68,7 +74,8 @@ def open(store, mode='a', **kwargs):

path = kwargs.get('path', None)
# handle polymorphic store arg
store = normalize_store_arg(store, clobber=(mode == 'w'))
clobber = mode == 'w'
store = normalize_store_arg(store, clobber=clobber)
path = normalize_storage_path(path)

if mode in {'w', 'w-', 'x'}:
Expand Down Expand Up @@ -1069,3 +1076,110 @@ def copy_all(source, dest, shallow=False, without_attrs=False, log=None,
_log_copy_summary(log, dry_run, n_copied, n_skipped, n_bytes_copied)

return n_copied, n_skipped, n_bytes_copied


def consolidate_metadata(store, metadata_key='.zmetadata'):
"""
Consolidate all metadata for groups and arrays within the given store
into a single resource and put it under the given key.

This produces a single object in the backend store, containing all the
metadata read from all the zarr-related keys that can be found. After
metadata have been consolidated, use :func:`open_consolidated` to open
the root group in optimised, read-only mode, using the consolidated
metadata to reduce the number of read operations on the backend store.

Note, that if the metadata in the store is changed after this
consolidation, then the metadata read by :func:`open_consolidated`
would be incorrect unless this function is called again.

.. note:: This is an experimental feature.

Parameters
----------
store : MutableMapping or string
Store or path to directory in file system or name of zip file.
metadata_key : str
Key to put the consolidated metadata under.

Returns
-------
g : :class:`zarr.hierarchy.Group`
Group instance, opened with the new consolidated metadata.

See Also
--------
open_consolidated

"""
import json

store = normalize_store_arg(store)

def is_zarr_key(key):
return (key.endswith('.zarray') or key.endswith('.zgroup') or
key.endswith('.zattrs'))

out = {
'zarr_consolidated_format': 1,
'metadata': {
key: json.loads(ensure_str(store[key]))
for key in store if is_zarr_key(key)
}
}
store[metadata_key] = json_dumps(out).encode()
return open_consolidated(store, metadata_key=metadata_key)


def open_consolidated(store, metadata_key='.zmetadata', mode='r+'):
"""Open group using metadata previously consolidated into a single key.

This is an optimised method for opening a Zarr group, where instead of
traversing the group/array hierarchy by accessing the metadata keys at
each level, a single key contains all of the metadata for everything.
For remote data sources where the overhead of accessing a key is large
compared to the time to read data.

The group accessed must have already had its metadata consolidated into a
single key using the function :func:`consolidate_metadata`.

This optimised method only works in modes which do not change the
metadata, although the data may still be written/updated.

Parameters
----------
store : MutableMapping or string
Store or path to directory in file system or name of zip file.
metadata_key : str
Key to read the consolidated metadata from. The default (.zmetadata)
corresponds to the default used by :func:`consolidate_metadata`.
mode : {'r', 'r+'}, optional
Persistence mode: 'r' means read only (must exist); 'r+' means
read/write (must exist) although only writes to data are allowed,
changes to metadata including creation of new arrays or group
are not allowed.

Returns
-------
g : :class:`zarr.hierarchy.Group`
Group instance, opened with the consolidated metadata.

See Also
--------
consolidate_metadata

"""

from .storage import ConsolidatedMetadataStore

# normalize parameters
store = normalize_store_arg(store)
if mode not in {'r', 'r+'}:
raise ValueError("invalid mode, expected either 'r' or 'r+'; found {!r}"
.format(mode))

# setup metadata sotre
meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)

# pass through
return open(store=meta_store, chunk_store=store, mode=mode)
3 changes: 3 additions & 0 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,9 @@ def _load_metadata_nosync(self):
if config is None:
self._compressor = None
else:
# temporary workaround for
# https://github.com/zarr-developers/numcodecs/issues/78
config = dict(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting in PR ( #361 ) as this was fixed in Numcodecs 0.6.0 with PR ( zarr-developers/numcodecs#79 ). As we now require Numcodecs 0.6.0+ in Zarr, we get the fix and thus no longer need the workaround.

self._compressor = get_codec(config)

# setup filters
Expand Down
26 changes: 16 additions & 10 deletions zarr/creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,15 +346,15 @@ def array(data, **kwargs):
return z


def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', synchronizer=None, filters=None,
cache_metadata=True, cache_attrs=True, path=None, object_codec=None,
**kwargs):
def open_array(store=None, mode='a', shape=None, chunks=True, dtype=None,
compressor='default', fill_value=0, order='C', synchronizer=None,
filters=None, cache_metadata=True, cache_attrs=True, path=None,
object_codec=None, chunk_store=None, **kwargs):
"""Open an array using file-mode-like semantics.

Parameters
----------
store : MutableMapping or string
store : MutableMapping or string, optional
Store or path to directory in file system or name of zip file.
mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
Persistence mode: 'r' means read only (must exist); 'r+' means
Expand Down Expand Up @@ -391,6 +391,8 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
Array path within store.
object_codec : Codec, optional
A codec to encode object arrays, only needed if dtype=object.
chunk_store : MutableMapping or string, optional
Store or path to directory in file system or name of zip file.

Returns
-------
Expand Down Expand Up @@ -426,7 +428,10 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
# a : read/write if exists, create otherwise (default)

# handle polymorphic store arg
store = normalize_store_arg(store, clobber=(mode == 'w'))
clobber = mode == 'w'
store = normalize_store_arg(store, clobber=clobber)
if chunk_store is not None:
chunk_store = normalize_store_arg(chunk_store, clobber=clobber)
path = normalize_storage_path(path)

# API compatibility with h5py
Expand All @@ -448,7 +453,7 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
compressor=compressor, fill_value=fill_value,
order=order, filters=filters, overwrite=True, path=path,
object_codec=object_codec)
object_codec=object_codec, chunk_store=chunk_store)

elif mode == 'a':
if contains_group(store, path=path):
Expand All @@ -457,7 +462,7 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
compressor=compressor, fill_value=fill_value,
order=order, filters=filters, path=path,
object_codec=object_codec)
object_codec=object_codec, chunk_store=chunk_store)

elif mode in ['w-', 'x']:
if contains_group(store, path=path):
Expand All @@ -468,14 +473,15 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
compressor=compressor, fill_value=fill_value,
order=order, filters=filters, path=path,
object_codec=object_codec)
object_codec=object_codec, chunk_store=chunk_store)

# determine read only status
read_only = mode == 'r'

# instantiate array
z = Array(store, read_only=read_only, synchronizer=synchronizer,
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path)
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path,
chunk_store=chunk_store)

return z

Expand Down
Loading