Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088

mraspaud · 2020-02-27T13:19:24Z

This is an attempt at making DatasetIDs dynamic.

DatasetID objects are used in Satpy to index datasets in a Scene object. They are namedtuples that contain a fixed set of properties that we considered a common denominator for all datasets readable by Satpy. However, it has been shown on multiple occasions that it is becoming harder to make all the properties relevant for all the datasets. For example, polarization and level are such properties that only apply to one dataset type each (respectively SAR-C and Geocat). At the same time, more dataset types are coming into Satpy that need new properties, eg SLSTR which has nadir and oblique views that we need to include in the DatasetID if we want to discriminate the datasets efficiently.

To avoid having to extend the set of properties needed to identify all datasets in Satpy, I propose here to make the DatasetID objects dynamic in that the set of properties they use to identify datasets can be tuned per reader. At the same time we are still able to compare DatasetIDs from different datasets. Moreover, properties that are irrelevant to a given dataset are now omitted entirely. For example, wavelength doesn't make sense for a longitude dataset, and hence will not be a part of the DatasetID needed to identify longitude, even if the reader configuration defines it as a possible property for idenfying datasets.

For backwards compatibility, a default DatasetID is still available, but hopefully should be removed soon.

mraspaud · 2020-02-27T13:21:27Z

PS: In the shape this PR is now, I'm expecting this to break a lot of stuff. I'm throwing it here to hopefully get insightful comments that will show if I'm going in the right direction before I spend more time on this. IE, please give some feedback!

gerritholl · 2020-05-12T13:54:08Z

I wonder if this might help to make composite requirements more flexible. I have a use case in fogpy where I have defined a composite that needs:

Channel data for 7 channels, which may come from SEVIRI, ABI, FCI, or others.
Cloud microphysics data (3 parameters), which may come from NWCSAF, CLAAS/CMSAF, ABI L2, or others. They don't have the same name in different sources.

Currently, the former can match either by name ("IR_108", "C02") or by wavelength (10.8) (I'm not sure if this is the complete story as I still don't have a full understanding of how compositors work; see also #714 and #161). The latter can match only by name, I think. With this PR, the cloud microphysics datasets could have a provides: field with a name matching the requires: field in the composite, as an alternative of matching non-channel datasets by name only.

djhoese · 2020-05-12T14:05:41Z

I don't think dynamic datasets would help with this and I wouldn't want it to evolve to support this functionality in composites. DatasetIDs are identifiers. This requires/provides system is something separate. To full support this we'd have to add provides to all DatasetIDs from all readers. A similar approach that I could see is putting standard_name in the DatasetID and using that as a way to find something that would serve the purpose needed by the composite. If not that, then perhaps a completely separate interface for this.

The bad calibration test was moved to the dataset id tests, as it is now handled there.

mraspaud · 2020-07-29T21:25:12Z

I see two benefits from the top of my head:

The list of calibration levels is fixed, and trying to ask for a wrong calibration will automatically fail, independently of the reader.
The list is ordered, allowing satpy to sorts dataids by predetermined calibration order.

As for serialisation, there are already a couple of methods for it implemented, the idea being that the enum is converted back to text on dump, and recreated on the fly on load.

djhoese · 2020-07-29T22:04:41Z

The list of calibration levels is fixed

Fixed per reader though, right? What if a reader defines a new calibration that is in the middle of the existing/common calibrations? That means that calibration.reflectance from old code will no longer equal calibration.reflectance in new code. This is especially bad if you are serializing things between versions where the same reader has now added a new calibration level in this way. This is something that SIFT has run in to time and time again because we store pickled enums to a database, if we change the enums by doing anything other than appending then things can no longer be unpickled.

The list is ordered, allowing satpy to sorts dataids by predetermined calibration order.

That's cool, but it gets back to what was discussed on slack: it is hard to determine the order/sorting of things in a DataID. What is best, min/max resolution, lower/higher calibration? Having this be explicit rather than the result of some dynamic Enum ordering seems like a more maintainable solution.

As for serialisation, there are already a couple of methods for it implemented, the idea being that the enum is converted back to text on dump, and recreated on the fly on load.

This isn't useful if it only works inside the DataID class if the enum also shows up in the .attrs of the DataArray.

mraspaud · 2020-07-29T22:43:30Z

We can find a way to serialise that doesn't bring along any enum, just the name of the calibration level will be passed (also in the data attrs).

Adding flexibility by making the dataid configureable per reader makes the code more abstract in some places indeed. I'm trying to make this as clear as possible with the current code base, but I understand that the code can be harder to follow. Still, I feel it's an improvement over having hardcoded DatasetID keys.

Also, having acceptance tests that verify what sift expects from satpy would be useful to ensure that unacceptable changes don't go unnoticed.

Edit: remove grumpy comments, sorry about those.

mraspaud · 2020-07-30T09:11:13Z

The list of calibration levels is fixed

Fixed per reader though, right? What if a reader defines a new calibration that is in the middle of the existing/common calibrations? That means that calibration.reflectance from old code will no longer equal calibration.reflectance in new code. This is especially bad if you are serializing things between versions where the same reader has now added a new calibration level in this way. This is something that SIFT has run in to time and time again because we store pickled enums to a database, if we change the enums by doing anything other than appending then things can no longer be unpickled.

This is now addressed by passing only the name of the enum to the dataarray attrs and when serializing the dataid.

The list is ordered, allowing satpy to sorts dataids by predetermined calibration order.

That's cool, but it gets back to what was discussed on slack: it is hard to determine the order/sorting of things in a DataID. What is best, min/max resolution, lower/higher calibration? Having this be explicit rather than the result of some dynamic Enum ordering seems like a more maintainable solution.

It is explicit in the reader configs, as we explicitly give the calibration level in order of preference. I don't see how we can make clearer without hardcoding all the use cases.

As for serialisation, there are already a couple of methods for it implemented, the idea being that the enum is converted back to text on dump, and recreated on the fly on load.

This isn't useful if it only works inside the DataID class if the enum also shows up in the .attrs of the DataArray.

As mentioned above, this has been addressed in the latest commit.

mraspaud · 2020-07-30T09:39:48Z

Also, just to clarify why I think the enums aren't so problematic here:
the implementation of the dynamic enum in this PR implements __eq__ as checking the name equality only (ie not the value). So if we define dataid1 with calibration levels [level1, level2, level3] and dataid2 with calibration levels [level1, level1.5, level2, level2.5, level3], and
dataid1 is (name='hi', calibration='level2')
dataid2 is (name='hi', calibration='level2')
they will be considered equal!

Co-authored-by: Gerrit Holl <gerrit.holl@gmail.com>

djhoese

As I mentioned on slack:

I have NOT read through the code.
I think it is good to merge. Any additional changes can be moved to future pull requests. I think there may be too much code to review and at this point you are the one defining what "good enough" code is anyway.

We've had multiple people test this PR and I've been using it for a couple days. Everything we've tested, works. Anything we discover from now on will require an additional PR and we are more likely to find them if this is in the master branch.

Great job and thanks for taking the time and effort to get this done. I think this will really open the door for future flexibility in Satpy.

One thing: Could you update the title so someone reading the changelog will get a good idea/summary of what has changed (maybe mention DataID).

mraspaud · 2020-08-04T05:55:38Z

Thanks all for looking at this!

sfinkens · 2020-08-04T07:29:33Z

🎉 Great effort! Probably the biggest PR in satpy's history.

This is mostly a commit from David Hoese (see 6a380df - "Add support for Satpy 0.23+") which has been updated to match current Satpy master, i.e., to take commit "Refactor scene to privatise some attributes and methods" pytroll/satpy@e8f2218 into account. NOTE: From this commit MTG-SIFT doesn't work with Satpy versions <= 0.22 since it adapts to API changes introduced with the Satpy pull request "Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088" pytroll/satpy#1088

mraspaud added 3 commits February 25, 2020 18:43

Remove deprecated Dataset class

8d5b899

Make DatasetIDs dynamic

2a09ae8

Start adapting SLSTR reader to dynamic dataset ids

789b5d4

mraspaud added enhancement code enhancements, features, improvements component:dep_tree Dependency tree and dataset loading refactor component:scene labels Feb 27, 2020

mraspaud self-assigned this Feb 27, 2020

mraspaud added 3 commits February 27, 2020 14:23

Clean up style

979d2f9

Merge branch 'master' into feature-dynamic-datasetids

fa4d0a5

Merge branch 'master' into feature-dynamic-datasetids

4f9f8ea

mraspaud added this to In progress in PCW Spring 2020 May 4, 2020

Merge branch 'master' into feature-dynamic-datasetids

fffd680

gerritholl mentioned this pull request May 4, 2020

Add reading of pixel_quality variable to FCI FDHSI reader #1171

Closed

Use newer pyresample interface

be007f1

gerritholl mentioned this pull request May 6, 2020

Add support for reading pixel_quality ancillary variables, FCI reader no longer logs warnings #1177

Merged

5 tasks

mraspaud added 9 commits May 15, 2020 19:41

Work on new DatasetIDs

e712a43

Merge branch 'master' into feature-dynamic-datasetids

acae746

Continue migration to ds queries

f590711

Add calibrations for sar-c_safe

78b38fa

Fix seviri l1b native tests

385a986

Fix severi l1b native tests

70315f1

Fix aapp l1b tests

d8c14f6

Fix ABI l1b tests

7a155d0

The bad calibration test was moved to the dataset id tests, as it is now handled there.

Fix generic image reader tests

c4fc7b6

gerritholl mentioned this pull request May 27, 2020

Improve Compositor API --- Compositor looks like a dict but doesn't behave like one #1221

Open

Fix scene unit tests

5c50779

mraspaud added 6 commits July 30, 2020 09:01

Fix tests

52c438e

Fix style

3caab9a

Fix deprecation warnings

bbec61a

Remove print statement

4221a03

Fix deprecation warning

aeb5548

Fix style

8cff18c

gerritholl mentioned this pull request Jul 30, 2020

Scene assignment with existing name updates data but not attributes #1283

Open

Ensure no enums are passed to dataarray attrs

4508b80

mraspaud and others added 4 commits July 30, 2020 11:41

Fix typo in doc/source/dev_guide/satpy_internals.rst

7230b2d

Co-authored-by: Gerrit Holl <gerrit.holl@gmail.com>

Add unit test

210a0f5

Give an example of why custom types are useful

f37a756

Fix wavelength range display

2ac930b

djhoese mentioned this pull request Aug 3, 2020

Add support for Satpy 0.23+ and PROJ 6.0+ ssec/sift#301

Merged

1 task

djhoese approved these changes Aug 4, 2020

View reviewed changes

PCW Spring 2020 automation moved this from In progress to Ready to merge Aug 4, 2020

mraspaud changed the title ~~Make DatasetIDs dynamic~~ Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader Aug 4, 2020

mraspaud merged commit 367d725 into pytroll:master Aug 4, 2020

PCW Spring 2020 automation moved this from Ready to merge to Done Aug 4, 2020

djhoese mentioned this pull request Aug 4, 2020

abi_l2_nc reader unable to read MCMIP files #1162

Closed

mraspaud mentioned this pull request Sep 17, 2020

Add support for angle dataset a, b, i in nadir and oblique views for SLSTR reader #573

Closed

4 tasks

mraspaud deleted the feature-dynamic-datasetids branch September 17, 2020 09:48

mraspaud mentioned this pull request Sep 17, 2020

Specifying wavelength/frequency in other ways than just min, nominal and max #1369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088

Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088

mraspaud commented Feb 27, 2020 •

edited

mraspaud commented Feb 27, 2020

gerritholl commented May 12, 2020

djhoese commented May 12, 2020

mraspaud commented Jul 29, 2020

djhoese commented Jul 29, 2020

mraspaud commented Jul 29, 2020 •

edited

mraspaud commented Jul 30, 2020

mraspaud commented Jul 30, 2020 •

edited

djhoese left a comment

mraspaud commented Aug 4, 2020

sfinkens commented Aug 4, 2020

Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088

Make the metadata keys that uniquely identify a DataArray (DataID) configurable per reader #1088

Conversation

mraspaud commented Feb 27, 2020 • edited

mraspaud commented Feb 27, 2020

gerritholl commented May 12, 2020

djhoese commented May 12, 2020

mraspaud commented Jul 29, 2020

djhoese commented Jul 29, 2020

mraspaud commented Jul 29, 2020 • edited

mraspaud commented Jul 30, 2020

mraspaud commented Jul 30, 2020 • edited

djhoese left a comment

Choose a reason for hiding this comment

mraspaud commented Aug 4, 2020

sfinkens commented Aug 4, 2020

mraspaud commented Feb 27, 2020 •

edited

mraspaud commented Jul 29, 2020 •

edited

mraspaud commented Jul 30, 2020 •

edited