Lightweight C++ and Python wrapper for zarr and n5 file format.
Support for the following compression codecs:
Conda packages for the relevant systems and python versions (except python2.7 on windows) are hosted on conda-forge:
$ conda install -c conda-forge z5py
The easiest way to build the library from source is using a conda-environment with all necessary dependencies.
You can find the conda environment files for build environments in each of the supported python versions under unix in .environments/unix
To set up the conda environment and install the package on Unix (for python 3.6):
$ conda env create -f environments/unix/36.yml
$ conda activate z5-py36
$ mkdir bld
$ cd bld
$ cmake -DWITH_ZLIB=ON -DWITH_BZIP2=ON -DCMAKE_INSTALL_PREFIX=/path/to/install ..
$ make install
Note that in the CMakeLists.txt, we try to infer the active conda-environment automatically.
If this fails, you can set it manually via -DCMAKE_PREFIX_PATH=/path/to/conda-env
.
To specify where to install the package, set:
CMAKE_INSTALL_PREFIX
: where to install the C++ headersPYTHON_MODULE_INSTALL_DIR
: where to install the python package (set tosite-packages
of active conda env by default)
If you want to include z5 in another C++ project, note that the library itself is header-only. However, you need to link against the compression codecs that you use.
The Python API is very similar to h5py
.
Some differences are:
- The constructor of
File
takes the boolean argumentuse_zarr_format
, which determines whether the zarr or N5 format is used (if set toNone
, an attempt is made to automatically infer the format). - There is no need to close
File
, hence thewith
block isn't necessary (but supported). - Linked datasets (
my_file['new_ds'] = my_file['old_ds']
) are not supported - Broadcasting is only supported for scalars in
Dataset.__setitem__
- Arbitrary leading and trailing singleton dimensions can be added/removed/rolled through in
Dataset.__setitem__
- Compatibility of exception handling is a goal, but not necessarily guaranteed.
- Because zarr/N5 are usually used with large data,
z5py
compresses blocks by default whereh5py
does not. The default compressors are- zarr:
"blosc"
- n5:
"gzip"
- zarr:
Some examples:
import z5py
import numpy as np
# create a file and a dataset
f = z5py.File('array.zr', use_zarr_format=True)
ds = f.create_dataset('data', shape=(1000, 1000), chunks=(100, 100), dtype='float32')
# write array to a roi
x = np.random.random_sample(size=(500, 500)).astype('float32')
ds[:500, :500] = x
# broadcast a scalar to a roi
ds[500:, 500:] = 42.
# read array from a roi
y = ds[250:750, 250:750]
# create a group and create a dataset in the group
g = f.create_group('local_group')
g.create_dataset('local_data', shape=(100, 100), chunks=(10, 10), dtype='uint32')
# open dataset from group or file
ds_local1 = f['local_group/local_data']
ds_local2 = g['local_data']
# read and write attributes
attributes = ds.attrs
attributes['foo'] = 'bar'
baz = attributes['foo']
There are convenience functions to convert n5 and zarr files to and from hdf5 or tif. Additional data formats will follow.
# convert existing h5 file to n5
# this only works if h5py is available
from z5py.converter import convert_from_h5
h5_file = '/path/to/file.h5'
n5_file = '/path/to/file.n5'
h5_key = n5_key = 'data'
target_chunks = (64, 64, 64)
n_threads = 8
convert_from_h5(h5_file, n5_file,
in_path_in_file=h5_key,
out_path_in_file=n5_key,
chunks=target_chunks,
n_threads=n_threads,
compression='gzip')
The library is intended to be used with a multiarray, that holds data in memory.
By default, xtensor is used.
See https://github.com/constantinpape/z5/blob/master/include/z5/multiarray/xtensor_access.hxx.
There also exists an interface for marray.
See https://github.com/constantinpape/z5/blob/master/include/z5/multiarray/marray_access.hxx.
To interface with other multiarray implementation, reimplement readSubarray
and writeSubarray
.
Pull requests for additional multiarray support are welcome.
Some examples:
#include "xtensor/xarray.hxx"
#include "z5/dataset_factory.hxx"
#include "z5/multiarray/xtensor_access.hxx"
#include "json.hpp"
int main() {
// create a new zarr dataset
std::vector<size_t> shape = {1000, 1000, 1000};
std::vector<size_t> chunks = {100, 100, 100};
bool asZarr = true;
auto ds = z5::createDataset("ds.zr", "float32", shape, chunks, asZarr);
// write array to roi
std::vector<size_t> offset1 = {50, 100, 150};
std::vector<size_t> shape1 = {150, 200, 100};
xt::xarray<float> array1(shape1, 42.);
z5::multiarray::writeSubarray(ds, array1, offset1.begin());
// read array from roi (values that were not written before are filled with a fill-value)
std::vector<size_t> offset2 = {100, 100, 100};
std::vector<size_t> shape2 = {300, 200, 75};
xt::xarray<float> array2(shape2);
z5::multiarray::readSubarray(ds, array2, offset2.begin());
// read and write json attributes
nlohmann::json attributesIn;
attributesIn["bar"] = "foo";
attributesIn["pi"] = 3.141593
z5::writeAttributes(ds->handle(), attributesIn);
nlohmann::json attributesOut;
z5::readAttributes(ds->handle(), attributesOut);
return 0;
}
This library implements the zarr and n5 data specification in C++ and Python. Use it, if you need access to these formats from these languages. Zarr / n5 have native implementations in Python / Java. If you only need access in the respective native language, it is recommended to use these implementations, which are more thoroughly tested.
- No thread / process synchonization -> writing to the same chunk in parallel will lead to undefined behavior.
- Supports only little endianness and C-order for the zarr format.
Internally, n5 uses column-major (i.e. x, y, z) axis ordering, while z5 uses row-major (i.e. z, y, x). While this is mostly handled internally, it means that the metadata does not transfer 1 to 1, but needs to be reversed for most shapes. Concretely:
n5 | z5 | |
---|---|---|
Shape | s_x, s_y, s_z | s_z, s_y, s_x |
Chunk-Shape | c_x, c_y, c_z | c_z, c_y, c_x |
Chunk-Ids | i_x, i_y, i_z | i_z, i_y, i_x |