Skip to content

Commit

Permalink
Add zarr simulation output store (#102)
Browse files Browse the repository at this point in the history
* rename transposed_vars -> transposed_dims

* fix process fixture

* wip add zarr output store and refactor [skip_ci]

* fix existing tests + some optimization

* add zarr as a required dependency

* remove umused arguments

* add xarray accessor API entry point

* set better default fill values for float dtypes

* wip add tests

* doc: add zarr to intersphinx

* refactor after #103

* compress also when using default in-memory store

* minor tweaks and fixes

* more tests

* leave zarr computes chunk sizes for now

* default in-memory store: force loading all data

* doc: add user guide section on model io storage

* black

* update release notes

* ci: update doc requirements

* ci: add dask

* don't lazily load scalar data variables

* user guide tweaks
  • Loading branch information
benbovy committed Mar 3, 2020
1 parent d5db973 commit 63b441d
Show file tree
Hide file tree
Showing 19 changed files with 553 additions and 215 deletions.
3 changes: 3 additions & 0 deletions ci/requirements/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,13 @@ dependencies:
- numpy=1.17.2
- pandas=0.25.1
- xarray=0.13.0
- netcdf4=1.5.3
- dask=2.11.0
- ipython=7.8.0
- matplotlib=3.0.2
- graphviz
- python-graphviz
- nbconvert=5.6.0
- sphinx=1.8.5
- pandoc=2.7.3
- zarr=2.4.0
2 changes: 2 additions & 0 deletions ci/requirements/py36.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@ dependencies:
- pytest
- numpy
- xarray
- dask
- graphviz
- python-graphviz
- ipython
- zarr
- pip:
- coveralls
- pytest-cov
2 changes: 2 additions & 0 deletions ci/requirements/py37-attrs-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ dependencies:
- pytest
- numpy
- xarray
- dask
- zarr
- pip:
- git+https://github.com/python-attrs/attrs.git
- coveralls
Expand Down
2 changes: 2 additions & 0 deletions ci/requirements/py37-xarray-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ dependencies:
- python=3.7
- pytest
- numpy
- zarr
- dask
- pip:
- git+https://github.com/pydata/xarray.git
- coveralls
Expand Down
2 changes: 2 additions & 0 deletions ci/requirements/py37.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@ dependencies:
- pytest
- numpy
- xarray
- dask
- graphviz
- python-graphviz
- ipython
- zarr
- pip:
- coveralls
- pytest-cov
2 changes: 2 additions & 0 deletions ci/requirements/py38.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ dependencies:
- pytest
- numpy
- xarray
- dask
- graphviz
- python-graphviz
- zarr
- pip:
- coveralls
- pytest-cov
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,7 @@
"attr": ("http://www.attrs.org/en/stable/", None),
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
"xarray": ("https://xarray.pydata.org/en/stable/", None),
"zarr": ("https://zarr.readthedocs.io/en/stable/", None),
}


Expand Down
2 changes: 2 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Documentation index
* :doc:`create_model`
* :doc:`inspect_model`
* :doc:`run_model`
* :doc:`io_storage`
* :doc:`monitor`
* :doc:`testing`

Expand All @@ -47,6 +48,7 @@ Documentation index
create_model
inspect_model
run_model
io_storage
monitor
testing

Expand Down
1 change: 1 addition & 0 deletions doc/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Required dependencies
- `attrs <http://www.attrs.org>`__ (18.2.0 or later)
- `numpy <http://www.numpy.org/>`__
- `xarray <http://xarray.pydata.org>`__ (0.10.0 or later)
- `zarr <https://zarr.readthedocs.io>`__ (2.3.0 or later)

Optional dependencies
---------------------
Expand Down
145 changes: 145 additions & 0 deletions doc/io_storage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
.. _io_storage:

Store Model Inputs and Outputs
==============================

Model inputs and/or outputs can be kept in memory or saved on disk using either
`xarray`_'s or `zarr`_'s I/O capabilities.

.. _xarray: http://xarray.pydata.org
.. _zarr: https://zarr.readthedocs.io/en/stable

.. ipython:: python
:suppress:
import sys
sys.path.append('scripts')
from advection_model import model2
Using xarray
------------

The :class:`xarray.Dataset` structure, used for both simulation inputs and
outputs, already supports serialization and I/O to several file formats, among
which netCDF_ is the recommended format. For more information, see section
`reading and writing files`_ in xarray's docs.

.. _netCDF: https://www.unidata.ucar.edu/software/netcdf/
.. _`reading and writing files`: http://xarray.pydata.org/en/stable/io.html

Before showing some examples, let's first create the same initial setup than the
one used in section :doc:`run_model`:

.. ipython:: python
model2
.. ipython:: python
import xsimlab as xs
ds = xs.create_setup(
model=model2,
clocks={
'time': np.linspace(0., 1., 101),
'otime': [0, 0.5, 1]
},
master_clock='time',
input_vars={
'grid': {'length': 1.5, 'spacing': 0.01},
'init': {'loc': 0.3, 'scale': 0.1},
'advect__v': 1.
},
output_vars={
'grid__x': None,
'profile__u': 'otime'
}
)
ds
You can save the dataset here above, e.g., using :meth:`xarray.Dataset.to_netcdf`

.. ipython:: python
ds.to_netcdf("model2_setup.nc")
You can then reload this setup later or elsewhere before starting a new
simulation:

.. ipython:: python
import xarray as xr
in_ds = xr.load_dataset("model2_setup.nc")
out_ds = in_ds.xsimlab.run(model=model2)
out_ds
The latter dataset ``out_ds`` contains both the inputs and the outputs of this
model run. Likewise, You can write it to a netCDF file or any other format
supported by xarray, e.g.,

.. ipython:: python
out_ds.to_netcdf("model2_run.nc")
Using zarr
----------

When :meth:`xarray.Dataset.xsimlab.run` is called, xarray-simlab uses the zarr_
library to efficiently store (i.e., with compression) both simulation input and
output data. The output data is stored progressively as the simulation proceeds.

By default all this data is saved into memory. For large amounts of model I/O
data, however, it is recommended to save the data on disk. For example, you can
specify a directory where to store it:

.. ipython:: python
out_ds = in_ds.xsimlab.run(model=model2, output_store="model2_run.zarr")
You can also store the data in a temporary directory:

.. ipython:: python
import zarr
out_ds = in_ds.xsimlab.run(model=model2, output_store=zarr.TempStore())
Or you can directly use :func:`zarr.group` for more options, e.g., if you want
to overwrite a directory that has been used for old model runs:

.. ipython:: python
zg = zarr.group("model2_run.zarr", overwrite=True)
out_ds = in_ds.xsimlab.run(model=model2, output_store=zg)
.. note::

The zarr library provides many storage alternatives, including support for
distributed/cloud and database storage systems (see `storage alternatives`_
in zarr's tutorial). Note, however, that a few alternatives won't work well
with xarray-simlab. For example, :class:`zarr.ZipStore` doesn't support
feeding a zarr dataset once it has been created.

Regardless of the chosen alternative, :meth:`xarray.Dataset.xsimlab.run` returns
a ``xarray.Dataset`` object that contains the data (lazily) loaded from the zarr
store:

.. ipython:: python
out_ds
Zarr stores large multi-dimensional arrays as contiguous chunks. When opened as
a ``xarray.Dataset``, xarray keeps track of those chunks, which enables efficient
and parallel post-processing via the dask_ library (see section `parallel
computing with Dask`_ in xarray's docs).

.. _`storage alternatives`: https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives
.. _`parallel computing with Dask`: http://xarray.pydata.org/en/stable/dask.html
.. _dask: https://dask.org/

.. ipython:: python
:suppress:
import os
import shutil
os.remove("model2_setup.nc")
os.remove("model2_run.nc")
shutil.rmtree("model2_run.zarr")
1 change: 1 addition & 0 deletions doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Enhancements
:class:`~xsimlab.RuntimeHook` class.
- Added some useful properties and methods to the ``xarray.Dataset.xsimlab``
extension (:issue:`103`).
- Save model inputs/outputs using zarr (:issue:`102`).

Bug fixes
~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
packages=find_packages(),
long_description=(open("README.rst").read() if exists("README.rst") else ""),
python_requires=">=3.5",
install_requires=["attrs >= 18.1.0", "numpy", "xarray >= 0.10.0"],
install_requires=["attrs >= 18.1.0", "numpy", "xarray >= 0.10.0", "zarr >= 2.3.0"],
tests_require=["pytest >= 3.3.0"],
zip_safe=False,
)

0 comments on commit 63b441d

Please sign in to comment.