Skip to content

Commit

Permalink
Merge pull request #1088 from mraspaud/feature-dynamic-datasetids
Browse files Browse the repository at this point in the history
Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader
  • Loading branch information
mraspaud committed Aug 4, 2020
2 parents f80c568 + 2ac930b commit 367d725
Show file tree
Hide file tree
Showing 143 changed files with 3,692 additions and 3,031 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Expand Up @@ -33,6 +33,9 @@ htmlcov
#Translations
*.mo

#Sphinx
doc/source/_build/*

#Mr Developer
.mr.developer.cfg

Expand Down
1 change: 1 addition & 0 deletions .travis.yml
Expand Up @@ -65,6 +65,7 @@ install:
script:
- pytest --cov=satpy satpy/tests
- coverage run -a --source=satpy -m behave satpy/tests/features --tags=-download
- if [ "$TRAVIS_EVENT_TYPE" == "cron" ]; then coverage run -a --source=satpy -m behave satpy/tests/features; fi
after_success:
- if [[ $PYTHON_VERSION == 3.8 ]]; then coveralls; codecov; fi
deploy:
Expand Down
10 changes: 9 additions & 1 deletion doc/source/composites.rst
Expand Up @@ -2,6 +2,14 @@
Composites
==========

Composites are defined as arrays of data that are created by processing and/or
combining one or multiple data arrays (prerequisites) together.

Composites are generated in satpy using Compositor classes. The attributes of the
resulting composites are usually a combination of the prerequisites' attributes and
the key/values of the DataID used to identify it.


Built-in Compositors
====================

Expand Down Expand Up @@ -430,7 +438,7 @@ Enhancing the images
- palettize
- three_d_effect
- btemp_threshold

.. todo::

Should this be in another file/page?
Expand Down
8 changes: 6 additions & 2 deletions doc/source/dev_guide/custom_reader.rst
Expand Up @@ -122,6 +122,10 @@ The parameters to provide in this section are:
sensors: [seviri]
reader: !!python/name:satpy.readers.yaml_reader.FileYAMLReader
Optionally, if you need to customize the `DataID` for this reader, you can provide the
relevant keys with a `data_identification_keys` item here. See the :doc:`satpy_internals`
section for more information.

.. _custom_reader_file_types_section:

The ``file_types`` section
Expand Down Expand Up @@ -476,7 +480,7 @@ needs to implement a few methods:
in the example below.

- the ``get_area_def`` method, that takes as single argument the
:class:`~satpy.dataset.DatasetID` for which we want
:class:`~satpy.dataset.DataID` for which we want
the area. It should return a :class:`~pyresample.geometry.AreaDefinition`
object. For data that cannot be geolocated with an area
definition, the pixel coordinates will be loaded using the
Expand Down Expand Up @@ -539,7 +543,7 @@ One way of implementing a file handler is shown below:
self.nc = None
def get_dataset(self, dataset_id, dataset_info):
if dataset_id.calibration != 'radiance':
if dataset_id['calibration'] != 'radiance':
# TODO: implement calibration to reflectance or brightness temperature
return
if self.nc is None:
Expand Down
1 change: 1 addition & 0 deletions doc/source/dev_guide/index.rst
Expand Up @@ -15,6 +15,7 @@ at the pages listed below.
xarray_migration
custom_reader
plugins
satpy_internals

Coding guidelines
=================
Expand Down
157 changes: 157 additions & 0 deletions doc/source/dev_guide/satpy_internals.rst
@@ -0,0 +1,157 @@
======================================================
Satpy internal workings: having a look under the hood
======================================================

Querying and identifying data arrays
====================================

DataQuery
---------

The loading of data in Satpy is usually done through giving the name or the wavelength of the data arrays we are interested
in. This way, the highest, most calibrated data arrays is often returned.

However, in some cases, we need more control over the loading of the data arrays. The way to accomplish this is to load
data arrays using queries, eg::

scn.load([DataQuery(name='channel1', resolution=400)]

Here a data array with name `channel1` and of resolution `400` will be loaded if available.

Note that None is not a valid value, and keys having a value set to None will simply be ignored.

If one wants to use wildcards to query data, just provide `'*'`, eg::

scn.load([DataQuery(name='channel1', resolution=400, calibration='*')]

Alternatively, one can provide a list as parameter to query data, like this::

scn.load([DataQuery(name='channel1', resolution=[400, 800])]



DataID
------

Satpy stores loaded data arrays in a special dictionary (`DatasetDict`) inside scene objects.
In order to identify each data array uniquely, Satpy is assigning an ID to each data array, which is then used as the key in
the scene object. These IDs are of type `DataID` and are immutable. They are not supposed to be used by regular users and should only be
created in special circumstances. Satpy should take care of creating and assigning these automatically. They are also stored in the
`attrs` of each data array as `_satpy_id`.

Default and custom metadata keys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One thing however that the user has control over is which metadata keys are relevant to which datasets. Satpy provides two default sets
of metadata key (or ID keys), one for regular imager bands, and the other for composites.
The first one contains: name, wavelength, resolution, calibration, modifiers.
The second one contains: name, resolution.

As an example here is the definition of the first one in yaml:

.. code-block:: yaml
data_identification_keys:
name:
required: true
wavelength:
type: !!python/name:satpy.dataset.WavelengthRange
resolution:
transitive: true
calibration:
enum:
- reflectance
- brightness_temperature
- radiance
- counts
modifiers:
required: true
default: []
type: !!python/name:satpy.dataset.ModifierTuple
To create a new set, the user can provide indications in the relevant yaml file.
It has to be provided in header of the reader configuration file, under the `reader`
section, as `data_identification_keys`. Each key under this is the name of relevant
metadata key that will used to find relevant information in the attributes of the data
arrays. Under each of this, a few options are available:

- `required`: if the item is required, False by default
- `type`: the type to use. More on this further down.
- `enum`: if the item has to be limited to a finite number of options, an enum can be used.
Be sure to place the options in the order of preference, with the most desirable option on top.
- `default`: the default value to assign to the item if nothing (or None) is provided. If this
option isn't provided, the key will simply be omited if it is not present in the attrs or if it
is None. It will be passed to the type's `convert` method if available.
- `transitive`: whether the key is to be passed when looking for dependencies. Here for example,
a composite that has to be at a certain resolution will pass this resolution requirement to its
dependencies.


If the definition of the metadata keys need to be done in python rather than in a yaml file, it will
be a dictionary very similar to the yaml code. Here is the same example as above in python:

.. code-block:: python
from satpy.dataset import WavelengthRange, ModifierTuple
id_keys_config = {'name': {
'required': True,
},
'wavelength': {
'type': WavelengthRange,
},
'resolution': None,
'calibration': {
'enum': [
'reflectance',
'brightness_temperature',
'radiance',
'counts'
]
},
'modifiers': {
'required': True,
'default': ModifierTuple(),
'type': ModifierTuple,
},
}
Types
~~~~~
Types are classes that implement a type to be used as value for metadata in the `DataID`. They have
to implement a few methods:

- a `convert` class method that returns it's argument as an instance of the class
- `__hash__`, `__eq__` and `__ne__` methods
- a `distance` method the tells how "far" an instance of this class is from it's argument.

An example of such a class is the :class:`WavelengthRange <satpy.dataset.WavelengthRange>` class.
Through its implementation, it allows us to use the wavelength in a query to find out which of the
DataID in a list which has its central wavelength closest to that query for example.


DataID and DataQuery interactions
=================================

Different DataIDs and DataQuerys can have different metadata items defined. As such
we define equality between different instances of these classes, and across the classes
as equality between the sorted key/value pairs shared between the instances.
If a DataQuery has one or more values set to `'*'`, the corresponding key/value pair will be omitted from the comparison.
Instances sharing no keys will no be equal.


Breaking changes from DatasetIDs
================================

- The way to access values from the DataID and DataQuery is through getitem: `my_dataid['resolution']`
- For checking if a dataset is loaded, use `'mydataset' in scene`, as `'mydataset' in scene.keys()` will always return `False`:
the `DatasetDict` instance only supports `DataID` as key type.

Creating DataID for tests
=========================

Sometimes, it is useful to create `DataID` instances for testing purposes. For these cases, the `satpy.tests.utils` module
now has a `make_dsid` function that can be used just for this::

from satpy.tests.utils import make_dataid
did = make_dataid(name='camembert', modifiers=('runny',))
4 changes: 2 additions & 2 deletions doc/source/multiscene.rst
Expand Up @@ -110,9 +110,9 @@ roughly the same time. First, create scenes and load datasets individually:

Now create a ``MultiScene`` and group the three similar IR channels together:

>>> from satpy import MultiScene, DatasetID
>>> from satpy import MultiScene, DataQuery
>>> mscn = MultiScene([h8_scene, g16_scene, met10_scene])
>>> groups = {DatasetID('IR_group', wavelength=(10, 11, 12)): ['B13', 'C13', 'IR_108']}
>>> groups = {DataQuery('IR_group', wavelength=(10, 11, 12)): ['B13', 'C13', 'IR_108']}
>>> mscn.group(groups)

Finally, resample the datasets to a common grid and blend them together:
Expand Down
9 changes: 5 additions & 4 deletions doc/source/overview.rst
Expand Up @@ -48,11 +48,12 @@ For help on developing with dask and xarray see
:doc:`dev_guide/xarray_migration` or the documentation for the specific
project.

To uniquely identify ``DataArray`` objects Satpy uses `DatasetID`. A
``DatasetID`` consists of various pieces of available metadata. This usually
includes `name` and `wavelength` as identifying metadata, but also includes
To uniquely identify ``DataArray`` objects Satpy uses `DataID`. A
``DataID`` consists of various pieces of available metadata. This usually
includes `name` and `wavelength` as identifying metadata, but can also include
`resolution`, `calibration`, `polarization`, and additional `modifiers`
to further distinguish one dataset from another.
to further distinguish one dataset from another. For more information on `DataID`
objects, have a look a :doc:`dev_guide/satpy_internals`.

.. warning::

Expand Down
11 changes: 5 additions & 6 deletions doc/source/readers.rst
Expand Up @@ -47,10 +47,10 @@ to them. By default Satpy will provide the version of the dataset with the
highest resolution and the highest level of calibration (brightness
temperature or reflectance over radiance). It is also possible to request one
of these exact versions of a dataset by using the
:class:`~satpy.dataset.DatasetID` class::
:class:`~satpy.dataset.DataQuery` class::

>>> from satpy import DatasetID
>>> my_channel_id = DatasetID(name='IR_016', calibration='radiance')
>>> from satpy import DataQuery
>>> my_channel_id = DataQuery(name='IR_016', calibration='radiance')
>>> scn.load([my_channel_id])
>>> print(scn['IR_016'])

Expand Down Expand Up @@ -93,7 +93,7 @@ load the datasets using e.g.::
If a dataset could not be loaded there is no exception raised. You must
check the
:meth:`scn.missing_datasets <satpy.scene.Scene.missing_datasets>`
property for any ``DatasetID`` that could not be loaded.
property for any ``DataID`` that could not be loaded.

To find out what datasets are available from a reader from the files that were
provided to the ``Scene`` use
Expand Down Expand Up @@ -137,8 +137,7 @@ Metadata
The datasets held by a scene also provide vital metadata such as dataset name, units, observation
time etc. The following attributes are standardized across all readers:

* ``name``, ``wavelength``, ``resolution``, ``polarization``, ``calibration``, ``level``,
``modifiers``: See :class:`satpy.dataset.DatasetID`.
* ``name``, and other identifying metadata keys: See :doc:`dev_guide/satpy_internals`.
* ``start_time``: Left boundary of the time interval covered by the dataset.
* ``end_time``: Right boundary of the time interval covered by the dataset.
* ``area``: :class:`~pyresample.geometry.AreaDefinition` or
Expand Down
5 changes: 2 additions & 3 deletions satpy/__init__.py
Expand Up @@ -15,8 +15,7 @@
#
# You should have received a copy of the GNU General Public License along with
# satpy. If not, see <http://www.gnu.org/licenses/>.
"""Satpy Package initializer.
"""
"""Satpy Package initializer."""

import os
from pkg_resources import get_distribution, DistributionNotFound
Expand Down Expand Up @@ -47,7 +46,7 @@
CALIBRATION_ORDER = {cal: idx for idx, cal in enumerate(CALIBRATION_ORDER)}

from satpy.utils import get_logger # noqa
from satpy.dataset import DatasetID, DATASET_KEYS # noqa
from satpy.dataset import DataID, DataQuery # noqa
from satpy.readers import (DatasetDict, find_files_and_readers, # noqa
available_readers) # noqa
from satpy.writers import available_writers # noqa
Expand Down

0 comments on commit 367d725

Please sign in to comment.