Merge pull request #1088 from mraspaud/feature-dynamic-datasetids

Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader
pytroll · Aug 4, 2020 · 367d725 · 367d725
2 parents f80c568 + 2ac930b
commit 367d725
Show file tree

Hide file tree

Showing 143 changed files with 3,692 additions and 3,031 deletions.
diff --git a/.gitignore b/.gitignore
@@ -33,6 +33,9 @@ htmlcov
 #Translations
 *.mo
 
+#Sphinx
+doc/source/_build/*
+
 #Mr Developer
 .mr.developer.cfg
 

diff --git a/.travis.yml b/.travis.yml
@@ -65,6 +65,7 @@ install:
 script:
 - pytest --cov=satpy satpy/tests
 - coverage run -a --source=satpy -m behave satpy/tests/features --tags=-download
+- if [ "$TRAVIS_EVENT_TYPE" == "cron" ]; then coverage run -a --source=satpy -m behave satpy/tests/features; fi
 after_success:
 - if [[ $PYTHON_VERSION == 3.8 ]]; then coveralls; codecov; fi
 deploy:

diff --git a/doc/source/composites.rst b/doc/source/composites.rst
@@ -2,6 +2,14 @@
 Composites
 ==========
 
+Composites are defined as arrays of data that are created by processing and/or
+combining one or multiple data arrays (prerequisites) together.
+
+Composites are generated in satpy using Compositor classes. The attributes of the
+resulting composites are usually a combination of the prerequisites' attributes and
+the key/values of the DataID used to identify it.
+
+
 Built-in Compositors
 ====================
 
@@ -430,7 +438,7 @@ Enhancing the images
     - palettize
     - three_d_effect
     - btemp_threshold
-    
+
 .. todo::
 
     Should this be in another file/page?

diff --git a/doc/source/dev_guide/custom_reader.rst b/doc/source/dev_guide/custom_reader.rst
@@ -122,6 +122,10 @@ The parameters to provide in this section are:
       sensors: [seviri]
       reader: !!python/name:satpy.readers.yaml_reader.FileYAMLReader
 
+Optionally, if you need to customize the `DataID` for this reader, you can provide the
+relevant keys with a `data_identification_keys` item here. See the :doc:`satpy_internals`
+section for more information.
+
 .. _custom_reader_file_types_section:
 
 The ``file_types`` section
@@ -476,7 +480,7 @@ needs to implement a few methods:
    in the example below.
 
  - the ``get_area_def`` method, that takes as single argument the
-   :class:`~satpy.dataset.DatasetID` for which we want
+   :class:`~satpy.dataset.DataID` for which we want
    the area. It should return a :class:`~pyresample.geometry.AreaDefinition`
    object. For data that cannot be geolocated with an area
    definition, the pixel coordinates will be loaded using the
@@ -539,7 +543,7 @@ One way of implementing a file handler is shown below:
             self.nc = None
 
         def get_dataset(self, dataset_id, dataset_info):
-            if dataset_id.calibration != 'radiance':
+            if dataset_id['calibration'] != 'radiance':
                 # TODO: implement calibration to reflectance or brightness temperature
                 return
             if self.nc is None:

diff --git a/doc/source/dev_guide/index.rst b/doc/source/dev_guide/index.rst
@@ -15,6 +15,7 @@ at the pages listed below.
     xarray_migration
     custom_reader
     plugins
+    satpy_internals
 
 Coding guidelines
 =================

diff --git a/doc/source/dev_guide/satpy_internals.rst b/doc/source/dev_guide/satpy_internals.rst
@@ -0,0 +1,157 @@
+======================================================
+ Satpy internal workings: having a look under the hood
+======================================================
+
+Querying and identifying data arrays
+====================================
+
+DataQuery
+---------
+
+The loading of data in Satpy is usually done through giving the name or the wavelength of the data arrays we are interested
+in. This way, the highest, most calibrated data arrays is often returned.
+
+However, in some cases, we need more control over the loading of the data arrays. The way to accomplish this is to load
+data arrays using queries, eg::
+
+  scn.load([DataQuery(name='channel1', resolution=400)]
+
+Here a data array with name `channel1` and of resolution `400` will be loaded if available.
+
+Note that None is not a valid value, and keys having a value set to None will simply be ignored.
+
+If one wants to use wildcards to query data, just provide `'*'`, eg::
+
+  scn.load([DataQuery(name='channel1', resolution=400, calibration='*')]
+
+Alternatively, one can provide a list as parameter to query data, like this::
+
+  scn.load([DataQuery(name='channel1', resolution=[400, 800])]
+
+
+
+DataID
+------
+
+Satpy stores loaded data arrays in a special dictionary (`DatasetDict`) inside scene objects.
+In order to identify each data array uniquely, Satpy is assigning an ID to each data array, which is then used as the key in
+the scene object. These IDs are of type `DataID` and are immutable. They are not supposed to be used by regular users and should only be
+created in special circumstances. Satpy should take care of creating and assigning these automatically. They are also stored in the
+`attrs` of each data array as `_satpy_id`.
+
+Default and custom metadata keys
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+One thing however that the user has control over is which metadata keys are relevant to which datasets. Satpy provides two default sets
+of metadata key (or ID keys), one for regular imager bands, and the other for composites.
+The first one contains: name, wavelength, resolution, calibration, modifiers.
+The second one contains: name, resolution.
+
+As an example here is the definition of the first one in yaml:
+
+  .. code-block:: yaml
+
+    data_identification_keys:
+      name:
+        required: true
+      wavelength:
+        type: !!python/name:satpy.dataset.WavelengthRange
+      resolution:
+        transitive: true
+      calibration:
+        enum:
+            - reflectance
+            - brightness_temperature
+            - radiance
+            - counts
+      modifiers:
+        required: true
+        default: []
+        type: !!python/name:satpy.dataset.ModifierTuple
+
+To create a new set, the user can provide indications in the relevant yaml file.
+It has to be provided in header of the reader configuration file, under the `reader`
+section, as `data_identification_keys`. Each key under this is the name of relevant
+metadata key that will used to find relevant information in the attributes of the data
+arrays. Under each of this, a few options are available:
+
+ - `required`: if the item is required, False by default
+ - `type`: the type to use. More on this further down.
+ - `enum`: if the item has to be limited to a finite number of options, an enum can be used.
+   Be sure to place the options in the order of preference, with the most desirable option on top.
+ - `default`: the default value to assign to the item if nothing (or None) is provided. If this
+   option isn't provided, the key will simply be omited if it is not present in the attrs or if it
+   is None. It will be passed to the type's `convert` method if available.
+ - `transitive`: whether the key is to be passed when looking for dependencies. Here for example,
+   a composite that has to be at a certain resolution will pass this resolution requirement to its
+   dependencies.
+
+
+If the definition of the metadata keys need to be done in python rather than in a yaml file, it will
+be a dictionary very similar to the yaml code. Here is the same example as above in python:
+
+  .. code-block:: python
+
+    from satpy.dataset import WavelengthRange, ModifierTuple
+
+    id_keys_config = {'name': {
+                          'required': True,
+                      },
+                      'wavelength': {
+                          'type': WavelengthRange,
+                      },
+                      'resolution': None,
+                      'calibration': {
+                          'enum': [
+                              'reflectance',
+                              'brightness_temperature',
+                              'radiance',
+                              'counts'
+                              ]
+                      },
+                      'modifiers': {
+                          'required': True,
+                          'default': ModifierTuple(),
+                          'type': ModifierTuple,
+                      },
+                      }
+
+Types
+~~~~~
+Types are classes that implement a type to be used as value for metadata in the `DataID`. They have
+to implement a few methods:
+
+ - a `convert` class method that returns it's argument as an instance of the class
+ - `__hash__`, `__eq__` and `__ne__` methods
+ - a `distance` method the tells how "far" an instance of this class is from it's argument.
+
+An example of such a class is the :class:`WavelengthRange <satpy.dataset.WavelengthRange>` class.
+Through its implementation, it allows us to use the wavelength in a query to find out which of the
+DataID in a list which has its central wavelength closest to that query for example.
+
+
+DataID and DataQuery interactions
+=================================
+
+Different DataIDs and DataQuerys can have different metadata items defined. As such
+we define equality between different instances of these classes, and across the classes
+as equality between the sorted key/value pairs shared between the instances.
+If a DataQuery has one or more values set to `'*'`, the corresponding key/value pair will be omitted from the comparison.
+Instances sharing no keys will no be equal.
+
+
+Breaking changes from DatasetIDs
+================================
+
+ - The way to access values from the DataID and DataQuery is through getitem: `my_dataid['resolution']`
+ - For checking if a dataset is loaded, use `'mydataset' in scene`, as `'mydataset' in scene.keys()` will always return `False`:
+   the `DatasetDict` instance only supports `DataID` as key type.
+
+Creating DataID for tests
+=========================
+
+Sometimes, it is useful to create `DataID` instances for testing purposes. For these cases, the `satpy.tests.utils` module
+now has a `make_dsid` function that can be used just for this::
+
+  from satpy.tests.utils import make_dataid
+  did = make_dataid(name='camembert', modifiers=('runny',))
diff --git a/doc/source/multiscene.rst b/doc/source/multiscene.rst
@@ -110,9 +110,9 @@ roughly the same time. First, create scenes and load datasets individually:
 
 Now create a ``MultiScene`` and group the three similar IR channels together:
 
-    >>> from satpy import MultiScene, DatasetID
+    >>> from satpy import MultiScene, DataQuery
     >>> mscn = MultiScene([h8_scene, g16_scene, met10_scene])
-    >>> groups = {DatasetID('IR_group', wavelength=(10, 11, 12)): ['B13', 'C13', 'IR_108']}
+    >>> groups = {DataQuery('IR_group', wavelength=(10, 11, 12)): ['B13', 'C13', 'IR_108']}
     >>> mscn.group(groups)
 
 Finally, resample the datasets to a common grid and blend them together:

diff --git a/doc/source/overview.rst b/doc/source/overview.rst
@@ -48,11 +48,12 @@ For help on developing with dask and xarray see
 :doc:`dev_guide/xarray_migration` or the documentation for the specific
 project.
 
-To uniquely identify ``DataArray`` objects Satpy uses `DatasetID`. A
-``DatasetID`` consists of various pieces of available metadata. This usually
-includes `name` and `wavelength` as identifying metadata, but also includes
+To uniquely identify ``DataArray`` objects Satpy uses `DataID`. A
+``DataID`` consists of various pieces of available metadata. This usually
+includes `name` and `wavelength` as identifying metadata, but can also include
 `resolution`, `calibration`, `polarization`, and additional `modifiers`
-to further distinguish one dataset from another.
+to further distinguish one dataset from another. For more information on `DataID`
+objects, have a look a :doc:`dev_guide/satpy_internals`.
 
 .. warning::
 

diff --git a/doc/source/readers.rst b/doc/source/readers.rst
@@ -47,10 +47,10 @@ to them. By default Satpy will provide the version of the dataset with the
 highest resolution and the highest level of calibration (brightness
 temperature or reflectance over radiance). It is also possible to request one
 of these exact versions of a dataset by using the
-:class:`~satpy.dataset.DatasetID` class::
+:class:`~satpy.dataset.DataQuery` class::
 
-    >>> from satpy import DatasetID
-    >>> my_channel_id = DatasetID(name='IR_016', calibration='radiance')
+    >>> from satpy import DataQuery
+    >>> my_channel_id = DataQuery(name='IR_016', calibration='radiance')
     >>> scn.load([my_channel_id])
     >>> print(scn['IR_016'])
 
@@ -93,7 +93,7 @@ load the datasets using e.g.::
     If a dataset could not be loaded there is no exception raised. You must
     check the
     :meth:`scn.missing_datasets <satpy.scene.Scene.missing_datasets>`
-    property for any ``DatasetID`` that could not be loaded.
+    property for any ``DataID`` that could not be loaded.
 
 To find out what datasets are available from a reader from the files that were
 provided to the ``Scene`` use
@@ -137,8 +137,7 @@ Metadata
 The datasets held by a scene also provide vital metadata such as dataset name, units, observation
 time etc. The following attributes are standardized across all readers:
 
-* ``name``, ``wavelength``, ``resolution``, ``polarization``, ``calibration``, ``level``,
-  ``modifiers``: See :class:`satpy.dataset.DatasetID`.
+* ``name``, and other identifying metadata keys: See :doc:`dev_guide/satpy_internals`.
 * ``start_time``: Left boundary of the time interval covered by the dataset.
 * ``end_time``: Right boundary of the time interval covered by the dataset.
 * ``area``: :class:`~pyresample.geometry.AreaDefinition` or

diff --git a/satpy/__init__.py b/satpy/__init__.py
@@ -15,8 +15,7 @@
 #
 # You should have received a copy of the GNU General Public License along with
 # satpy.  If not, see <http://www.gnu.org/licenses/>.
-"""Satpy Package initializer.
-"""
+"""Satpy Package initializer."""
 
 import os
 from pkg_resources import get_distribution, DistributionNotFound
@@ -47,7 +46,7 @@
 CALIBRATION_ORDER = {cal: idx for idx, cal in enumerate(CALIBRATION_ORDER)}
 
 from satpy.utils import get_logger  # noqa
-from satpy.dataset import DatasetID, DATASET_KEYS  # noqa
+from satpy.dataset import DataID, DataQuery # noqa
 from satpy.readers import (DatasetDict, find_files_and_readers,  # noqa
                            available_readers)  # noqa
 from satpy.writers import available_writers  # noqa